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Introduction 


Over  the  past  decade,  genetic  changes  associated  with  recurrent  chromosome  breakpoints  have 
been  discovered  in  human  malignancies,  predominantly  of  haematologic  origin.  The 
characterization  of  these  alterations  has  demonstrated  that  these  changes  can  be  disease  specific 
and  can  have  functional  consequences  (e.g.  Philadelphia  chromosome  and  CML).  The  rationale 
underlying  this  proposal  is  that  functional  recurrent  breakpoint-associated  chromosomal 
changes  occur  during  breast  cancer  progression  and  that  their  discovery  and  characterization 
may  lead  to  novel  diagnostic  and  therapeutic  tools  for  breast  cancer  patients. 

By  combining  studies  in  cell  lines,  and  breast  tumors,  we  can  identify  important  genomic 
rearrangements  that  may  drive  breast  cancer  tumorigenesis.  With  these  studies  we  push  the 
envelope  in  developing  new  screening  methods  for  more  efficient,  informative  and  cost-effective 
discovery  of  recurrent  genomic  rearrangements  in  breast  cancer.  By  using  unbiased  approaches 
to  pursue  the  discovery  of  important  aberrant  breakpoints  in  the  breast  cancer  genome,  we 
widely  increase  the  chance  of  detecting  a  rare,  but  important  alteration,  and  recurrences  in 
different  breast  cancer  genomes.  By  studying  the  biology  of  such  an  aberrant  breakpoint  in  the 
genome,  it  might  be  feasible  to  discover  new  ways  of  targeting  these  specific  alterations  on  the 
protein  level  with  new  compounds.  Taken  together,  this  study  will  uniquely  determine  for  the 
first  time  recurrent  biological  relevant  genomic  changes  in  breast  cancers. 

Body 

Task  1:  Determine  the  recurrence  rate  of  the  breakpoints  in  breast  cancer  cell  lines. 

To  determine  the  recurrence  rate  of  the  genomic  breakpoints  in  breast  cancer  cell  lines,  I  created 
2  pools  with  breast  cancer  cell  line  DNA.  One  pool  contained  9  cell  lines  and  the  other  7.  In  total 
16  cell  lines  were  tested  for  the  presence  of  the  398  breakpoints.  Bands  were  present  in  about 
30%  of  the  PCR  products.  These  PCRs  were  performed  before  we  analyzed  them  in  depth  in 
MCF-7.  When  sequence  analyzing  the  breakpoints  in  MCF-7/BAC,  we  discovered  that  many 
breaks  were  induced  during  the  creation  of  the  BAC  library.  Other  breakpoints  were  eliminated 
by  inability  to  validate  on  BAC  or  MCF-7  cell  line  DNA,  presence  in  the  normal  population,  or 
redundancy. 

We  discovered  157  genomic  breakpoints  in  MCF-7  cells  that  could  be  confirmed  by  PCR  across 

breakpoint  joins  as  likely  somatic 
mutations,  and  of  which  only  some 
have  been  previously  described.  A 
total  of  79  genes  are  involved  in 
rearrangement  events,  including  10 
fusions  of  coding  exons  from 
different  genes  and  77  other  aberrant 
breakpoints  involving  known  or 
predicted  genes.  Among  the 
breakpoints  that  involved  genes,  we 
first  focused  on  those  10  gene  fusion 
predicted  to  lead  to  fusion  transcripts 
(see  Table  1). 

Tablel:  Gene  fusions  discovered  in  MCF-7  breast  cancer  cell  line 


Fusion  Genes  in  MCF-7  Cells 


ARFGEF2: 

SULF2 

Intra-Chr 

Inversion 

20q13.13;20q13.13 

Fusion  of  ARFGEF2  exon  1  to  SULF2 
exons  3-21,  1.2Mb  inversion 

DEPDC1B  : 
ELOVL7 

Intra-Chr 

Inversion 

5q12.1  ;5q12.1 

Fusion  of DEPDC1B  exons  1-7  (outof  11) 
with  ELOVL7  exons  8-9,  127Kb  inversion 

RAD51C  : A TXN7 

Inter-Chr 

Rearrangement 

3p14.1;17q22 

Fusion  of  RAD 51 C  N-term  in  us  exons  1-7 
(out  of  9)  with  ATXN7  exons  6-1  3 

SULF2  : 

PRICKLE2 

Inter-Chr 

Rearrangement 

3p1  4.1  ;20q13.13 

Fusion  of  SULF2exon  1  with  lastexon  of 

PRICKLE2 

NPEPPS  :  USP32 

Intra-Chr 

Inversion 

17q21.32;17q23.2 

Fusion  of  NPEPPS  exons  1-12  (out  of23) 
with  USP32ex ons  2-34,  13Mb  inversion 

ASTN2 : PTPRG 

Inter-Chr 

Rearrangement 

3p14.2;9q33.1 

Fusion  of  ASTN 2  exons  1-10  (outof  22) 
with  PTPRG  exons  3-30 

BCAS3: BCAS4 

Inter-Chr 

Rearrangement 

17q23.2;20q13.13 

BCAS4  exon  1  fused  to  BCAS3  exons  23- 
24,  Ruan  et  at.  (Genome  Res  17:828-838) 

BCAS3 : RSBN1 

Inter-Chr 

Rearrangement 

1p13.2;17q23.2 

Fusion  of  RSBN1  first  exon  with  BCAS3 
exons  6-24 

AS  TN2  : 

TBC1D16 

Inter-Chr 

Rearrangement 

9q33.1;17q25.3 

Fusion  of  ASTN2  exons  1-15  with 

TBC1D16  exons  2-12 

BCAS4: 

PRKCBP1 

Intra-Chr 

Inversion 

20  ql  3.12;  20q13.13 

Fusion  of  BCAS4  exon  1  with  PRKCBP1 
exons  5-22,  3.5Mb  inversion 
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For  a  gene  fusion  to  have  a  function  significance  it  needs  to  make  an  aberrant  protein  that  will 
have  an  alternate  function  then  the  wildtype  proteins.  The  first  step  in  identifying  these  is  to 
determine  if  these  gene  fusions  produce  a  chimeric  mRNA.  To  determine  if  the  predicted 
chimeric  mRNA  transcript  was  created  by  these  genomic  fusions,  I  performed  gene-specific  RT- 
PCR  on  MCF-7  and  2  normal  controls.  Out  of  ten  DNA  fusions,  four  showed  a  fusion  mRNA 
transcript  in  MCF-7  specifically  by  RT-PCR  (Figure  1). 

i - ii - 1  Three  of  these  are  newly  identified 

BP 8:  ARFGEF2-SULF2  Fusion  BP  35:  RAD51C-ATXN7  Fusion  J 

ARFGEF2  Exon  1  SULF2  Exon  3  RAD51 C  Exon  7  J^TXN7  Ex0n  6  (ARFGEF2/SULF2,  DEPDC 1 B/ELOVL7, 


_  <  _  «  ~_T  ~  g  RAD5 1C/ATXN7),  and  one  has  been 

1  I  I  1  I  S  I  i  I  1  S  I  1  j  S  previously  described  (BCAS4/BCAS3). 

I _ II _ 1 1 _ II _ dl _ I  If  a  genomic  fusion  is  present  in  other 

Fusion  SULF2  RAD51C  Fusion  ATXN7 

BP  80:  BCMS±BCAS3  Fusion  BP  33:  DEPDC  1B-ELOVL7  Fusion  brCaSt  CanCCr  Cell  lmCS  it  IS  nOt  Vd*y  llkdy 

BCAS4  Ea>n  1  BCAS3  Exon  23  DEPDC1B  Exon  7  ELOVL7  Exon  8  ■ ,  ,  . ,  ...  ... 

i  i  i  i  imnnmmmmm.  it  will  occur  exactly  at  the  same  position. 

1T”S  1 1  ~  a  £  Even  if  the  break  occurs  several  kilobases 

A  =  A  y  y.. j  s  up  or  down  stream  of  the  originally 

"depdcib  11  Fusion""  JJ1  discovered  breakpoint  in  MCF-7,  it  might 

still  create  the  same  down  stream 

Figure  1 :  Discovery  of  chimeric  mRNA  product  in  MCF-7.  RNA  was  isolated  from  MCF-7  cells, 

and  RT-PCR  was  preformed  for  control  regions  and  the  fusion.  These  data  clearly  show  the  COHSGQUCI1CG  bv  inctklfl.2  th.6  ScUllC  dlllllCriC 

presence  of  the  wildtype  transcript  in  MCF-7  and  the  two  control  RNAs  (from  MCF-10A,  and  J  o 

normal  breast),  whilethe  fusion  transcript  is  only  present  in  MCF-7.  IIlRNA  BCCciUSC  Of  this  I  tCStcd  the 

presence  of  the  4  different  chimeric 

mRNA  in  16  breast  cancer  cell  lines.  The  16  breast  cancer  cell  lines  were  divided  into  4  pools  of 
4  cell  lines.  Pool  1  and  2  showed  the  presence  of  a  band  after  performing  RT-PCR  for  the 
RAD5 1C/ATXN7  fusion  (Figure  2).  Sequencing  of  the  PCR  product  confirmed  the  presence  of 
the  fusion.  After  deconvo luting  the  pools,  I  discovered  that  the  RAD51C/ATXN7  fusion  was 

present  in  2  other  breast  cancer  cell  lines, 

A  B  n  1  I  T47D,  and  MDA  MB361.  The  fusion  of 

^  §  f  RAD5 1C  and  ATXN7  most  likely  results  in 

bccl  pool  1234  MCF-7  5o-WjM«<H  j  loss  of  a  critical  C-terminal  domain  of 

a  ~  *  RAD5 1C.  By  western  blot  I  was  able  to 

_ I  i  *  detect  a  shorter  band  in  MCF-7  and  MDA 

o  m  <  S 

b  g  e  1  l  |  MB361.  This  was  confirmed  by  performing 

Figure  2:  Discovery  of  chimeric  mRNA  &  ^  h  ^  s  i- 

product  of  the  rad5ic/atxn7  fusion  -  - I  ^  an  lmmunoprecipitatmg  RAD5 1C  with  a 

in  breast  cancer  cel  I  lines.  RNA  was  50  _l  a  a  g 

isolated  from  i6  bre^t  cancer  ceii  37_  IS  specific  antibody,  and  probing  the  western 

mZZZ  p  ,p  f)  .  fR.„.r  blot  with  a  different  RAD5 1C  antibody 

fusion.  The  presence  of  the  fusion  Figure  3:  Presence  of  a  truncated  form  of  RAD51 C  J 

transcript  was  confirmed  by  in  breast  cancer  cell  lines.  A)  Immunoprecipitation  tPlP'lirP 

sequencing.  was  performed  on  cell  lysates  with  a  mouse  anti-  V-1" 

an sZts-page and ®o“h anmbw! arT on  I  performed  preliminary  functional  studies 

with  the  addition  of  a  negative  control  MCF1 0-A  for  the  ARFGEF2/SULF2  fusion,  which  are 

reported  under  Task  4. 


BCCL  Pool  1  2  3  4  MCF-7 


Figure  2:  Discovery  of  chimeric  mRNA 
product  of  the  RAD51 C/ATXN7  fusion 
in  breast  cancer  cel  I  lines.  RNA  was 
isolated  from  16  breast  cancer  cell 
lines,  and  4  pools  of  4  cell  lines  were 
made.  RT-PCR  was  performed  the 
fusion.  The  presence  of  the  fusion 
transcript  was  confirmed  by 
sequencing. 


Figure  3:  Presence  of  a  truncated  form  of  RAD51 C 
in  breast  cancer  cell  lines.  A)  Immunoprecipitation 
was  performed  on  cell  lysates  with  a  mouse  anti- 
Rad51C  antibody.  Elutes  from  the  IP  were  run  on 
an  SDS-page  and  probed  with  an  rabbit  anti- 
Rad51C  antibody.  B)  Repeat  of  experiment  in  A, 
with  the  addition  of  a  negative  control  MCF10-A 


Another  angle  we  pursued  is  the  evolution  of  breakpoints.  To  gain  insight  into  the  heterogeneity 
of  genomic  breakpoints  in  breast  cancer  cell  lines,  and  to  also  narrow  down  on  breakpoints 
originating  from  the  ancestor,  I  studied  these  157  validated  breakpoints  in  seven  MCF-7  sub¬ 
lines  (Figure  4).  Thirty-one  breakpoints  from 
the  original  157  identified  in  MCF-7/BAC,  are 
present  in  all  MCF-7  sub-lines  (Figure  5).  When 
looking  at  the  distribution  of  these  breakpoints 
in  the  genome,  we  can  see  that  there  are  clusters 
of  breakpoints  on  certain  chromosomes,  and 
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MCF-7/B  MCF-7/ATCC  MCF-7/Neo 

123456  123456  12345 


MCF-7/BK 
1  2  3  4  5  6 


Figure  4:  Examples  of  breakpoint  analysis  by  PCR.  Breakpoints  were  scored  on  the  presence, 
size  and  number  of  PCR  bands. 


that  there  are  many  breakpoints  randomly  distributed 
throughout  the  genome.  When  focusing  on  the  3 1 
breakpoints  that  are  common  in  all  the  MCF-7  sub¬ 
lines,  it  is  clear  that  the  clusters  are  retained  and  that 
the  amount  of  breakpoint  randomly  distributed  is 
dramatically  reduced.  A  finding  of  interest  is  that  there 
is  a  great  enrichment  of  breakpoints  containing  genes 
(50.3%  vs  77.4%,  p=0.0056)  (Figure  6).  Even  more 
interesting  is  that  5  of  the  10  fusions  are  in  all  cell  lines 
(6.4%  vs  16.1%)  (Figure  6).  Also,  all  4  fusion  genes 
expressing  chimeric  mRNA  product  are  present  in  all 
MCF-7  sub-lines,  indicating  that  these  breakpoints  might  originate  from  an  ancestral  cell  line. 

With  this  information  we  may  get  a  better 


Figure  5:  Schema  of  breakpoints  present 
breakpoints  between  them 


i  individual  cell  lines  and  the  shared 


MCF-7  rearranged  genes 


Amount  of  breakpoints 
present  in  all  sub-lines 


© 


©  ©  ©  o 


Genes  NO  genes 

50.3% 


Genes  NO  gene 

77.4% 


of  breakpoints  contain  genes  of  breakpoints  contain  genes 

V _ .  . _ J 


P  =  0.0056 
(Fisher  exact) 


MCF-7  Gene  Fusions 
Total  amount  of 
discovered  breakpoints 


©  &  ©  © 


Ge  ne-gene  f  us  ions 


Gene- gene  fusions 


understanding  of  the  evolution  and 
heterogeneity  of  genomic  instability  and 
rearrangements  in  breast  cancer.  Also,  by 
narrowing  down  on  breakpoints  that  are  in  all 
the  MCF-7  sub-lines,  I  get  closer  to  the  ’true’ 
breakpoints  that  originated  in  the  tumor  of 
which  MCF-7  is  derived. 


/ 


Figure  6:  Graphic  representation  of  the  distribution  of  the  break  points.  The  enrichment  of 
gene-containing  breakpoints  is  statistically  significant  (p=0.0056,  Fisher  exact) 


To  address  the  clonal  variation  within  cell 
lines,  I  tested  for  the  presence  of  the  157 
breakpoints  in  a  set  of  MCF-7  cell  lines  that 
were  derived  from  the  same  parental  MCF-7 
line  MCF-7LG.  As  you  can  see  in  figure  7, 
there  is  close  to  95%  homology  between  the 
clones.  This  is  an  important  finding,  since  this 
shows  that  single  cells  from  a  population  are 
much  closer  related  then  we  expected,  and  that 
their  genomes  are  more  stable  then  speculated  in  the  past. 


Figure  7:  heatmap  showing  that  the 
clones  are  closely  related  to  the 
parental  cell  line  MCF-7LG.  Black  is 
present,  white  is  absent  breakpoint  in 
that  particular  cell  line. 


Task  2:  Determine  recurrence  rate  of  the  breakpoints  in  a  panel  of  breast  tumors. 


Figure  8:  A:  extraction  of  genomic  DNA  of  breast  cancer  tumors  and  cell  lines.  B:  Example  of 
validation  of  breakpoints  by  PCR  in  HCC1194 


To  use  genomic  DNA  for  the  development  of 
next-generation  sequencing  techniques  the  quality 
of  the  DNA  extracted  is  crucial.  I  compared 
extraction  protocols  to  determine  the  best 
technique  to  get  the  DNA  required  for  the  long 
range  PCR.  This  DNA  was  used  for  de 
development  of  new  high-throughput  sequencing 
for  the  discovery  of  breakpoints  in  breast  cancer. 
In  this  case  we  used  SOLiD  sequencing.  Our 
intension  was  to  generate  a  5  kb  mate  pair  library 
to  increase  the  sensitivity  of  the  discovery  of 
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breakpoints  in  the  genome.  We  encountered  a  lot  of  difficulty  to  generate  a  library  with  this  size 
insert,  and  spend  a  lot  of  time  and  resources  in  the  optimization  of  the  library  to  get  an 
appropriate  coverage  of  the  whole  genome.  We  were  able  to  generate  libraries  from  DNA 
extracted  from  breast  cancer  tissue,  and  its  paired  normal  control.  This  was  extremely  valuable, 
since  this  gave  us  insight  in  the  background  presence  of  breakpoints  and  fusion  in  the  DNA  of 
that  individual.  We  also  included  three  breast  cancer  cell  lines  (HCC1954,  MDA  MB231,  and 
MDA  MB361.  The  data  from  our  SOLiD  sequencing  looks  very  promising.  We  were  able  to  find 
recurrences  of  breakpoints  in  the  breast  tumors  as  well  as  in  the  breast  cancer  cell  lines.  I 
validated  by  PCR  the  identified  breakpoints,  and  we  had  an  unexpected  high  validation  rate  of 
almost  50%  (42.3-44.8%). 


Task  3:  Validate  breakpoints  in  an  independent  set  of  breast  cancer  tumors  and  associate 
breakpoints  with  histo-pathological  and  clinical  characteristics. 

a.  Develop  break-away  FISH  probes  for  the  detection  of  recurrent  breakpoints. 


FISH  probes  were  developed  for  the  RAD51C/ATXN7  fusion.  After  confirming  probe  was 
specificity  on  control  lymphocytes,  metaphase  spreads  of  MCF-7  and  MCF-10A  (negative 

control)  cells,  were 


Proximal  probe 


Distal  probe 


Figure  9:  Detection  RAD51C/ATXN7  fusion  by  FISH.  A)  Proximal  and  distal  probes  were  tested  on  metaphase 
spreads  of  MCF-7  cells.  MCF-7  shows  clear  amplification  of  RAD51C.  B)  Break-away  FISH  on  meta phases  of 
MCF-7  and  MCF-10A  cells. 


hybridized  with  probes 
proximal,  and  distal  of 
the  break  in  RAD5 1C, 
and  a  probe  for 
RAD51C  spanning  the 
break  (Figure  9).  These 
break-away  FISH 
experiments  did  not  give 
a  conclusive  answer, 
thus  we  decided  to  test 
for  the  presence  of  the 
fusion.  Probes  for 
RAD5 1 C  and  ATXN7 
were  developed,  and 
hybridized  on  MCF-7 
and  MCF-10A 
metaphase  spreads. 
These  results  clearly 
show  the  presence  of  RAD51C  and 
ATXN7  signal  in  close  proximity  in  the 
MCF-7  cells  and  not  in  the  MCF-10A 
cells  (Figure  10).  With  these  data  we  were 
able  to  generate  a  detection  tool  for  the 
presence  of  the  RAD51C/ATXN7  fusion, 
and  to  confirm  the  presence  of  the 
genomic  translocation  in  MCF-7  cells. 


ATXN7  &  RAD51C 

Figure  10:  Detection  RAD51C/ATXN7  fusion  by  FISH.  Metaphases  of  MCF-7  and  MCF10A  cells  were 
hybridized  with  a  RAD51C  probe  and  ATXN7  probe.  Yellow  signal  in  MCF-7  shows  co-localization  of  DA51C 
a  nd  ATXN7,  indicating  a  fusion  between  the  genes. 
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b.  Detect  recurrent  breaks  with  break-away  FISH  and  associate  with  histo-pathological  and 
clinical  characteristics. 

We  were  unable  to  validate  the  presence  of  the  fusion  on  genomic  level  by  FISH  in  T47D,  and 
MDA  MB  361.  Because  of  our  inability  to  validate  the  genomic  fusion  by  FISH,  and  the  fact  that 
this  technique  is  very  low-throughput,  we  decided  to  focus  our  attention  and  resources  on 
developing  a  high-throughput  screening  technique.  We  also  used  these  data  for  the  re-discovery 
of  break-points  and  for  the  discovery  of  previously  unidentified  breakpoints. 

Here  we  were  faced  with  2  key  challenges: 

(1)  sensitive  detection  of  chromosomal 
aberrations,  and  (2)  high-throughput 
validation  of  putative  chromosomal 
aberrations.  To  address  both  challenges,  we 
have  developed  the  fosmid  diTag  method, 
the  first  long-range  mate-pair  physical 
mapping  method  based  on  massively  parallel 
sequencing.  We  apply  the  fosmid  diTag  and 
5  Kbp  Illumina  mate  pair  methods  to  MCF7 
and  HCC1954  breast  cancer  cell  lines 
(Figure  11).  The  rearrangements  detected  by 
both  methods  show  a  3-fold  enrichment  for 
cancer-specific  somatic  mutations  compared 
to  those  detected  by  the  5Kbp  method  alone. 
Fosmid  diTag  method  also  reveals  much  higher  proportion  of  gene  fusions  and  truncations.  We 

first  mapped  aberrations  in  MCF7  and  HCC1954 
using  the  medium-range  Illumina  mate  pair 
method.  To  detect  rearrangements,  we  searched 
for  regions  where  at  least  two  independent  pairs 
of  ends  showed  discrepancy  with  their  predicted 
size  and/or  orientation.  We  found  23,5555 
rearrangements  in  MCF7  and  3,824  in  HCC1954. 
A  total  of  18,444  indels  were  identified  in  MCF7 
and  1,258  in  HCC1954  (Figure  12).  Other 
rearrangement  classes  are  based  on  incorrect 
orientation,  indicative  of  potential  inversions,  or 
whose  ends  map  to  different  chromosomes, 
predicting  an  interchromosomal  translocation.  In 
MCF7,  we  detected  1,575  inversions  and  3,536  translocations.  We  found  485  inversions  and 
2,081  translocations  in  HCC1954  (Figure  12).Again  we  were  intested  in  the  validation  rate  by 
PCR,  and  primers  spanning  the  breakpoints  were  tested  on  genomic  DNA  from  breast  cancer  cell 
lines  and  normal  controls.  The  PCR  assay  produced  a  single  strong  amplification  product  in  28% 
and  37%  of  the  reactions  for  MCF7  and  HCC1954,  respectively.  We  detect  and  validate  a  total  of 
91  somatic  rearrangements  in  MCF7  and  25  in  HCC1954,  including  genomic  alterations 
corresponding  to  50%  of  the  transcript  fusions  previously  discovered  by  transcript  mapping. 


somatic  mutation  validation  between  fosmid  diTag  and  Illumina  mate  pair  data  sets  in  the  MCF7  and  HCC1954 
genomes. 


Figure  11:  Fosmid  diTag  workflow  schema  illustrating  the  formation  and  amplification  of  the 
concatenated  26bp  end-tags  from  the  fosmid  insert  termini. 


Task  4:  Study  the  biological  significance  of  the  breakpoints  using  in  vitro  models. 
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a.  Determine  the  downstream  consequence  based  on  the  position  of  the  aberrant  joint. 

Based  on  protein  sequence  analysis  and  protein  translation  programs,  I  was  able  to  predict  the 
fusion  protein,  and  speculate  on  the  consequence  of  the  ARFGEF2/SULF2,  and 
RAD51C/ATXN7.  By  the  creation  of  the  ARFGEF2/SULF2  fusion,  the  Sulfatase  2  (SULF2) 
protein  loses  its  targeting  peptide  for  targeting  for  secretion,  while  the  added  sequence  of 
ARFGEF2  does  not  add  any  functional  domain.  This  might  mean  that  the  fusion  creates  a  non¬ 
functional  Sulfatase  2. 

By  using  protein  translation  programs,  it  became  clear  that  with  the  creation  of  the  fusion 
between  RAD51C  and  ATXN7  a  frameshift  in  the  codon  is  induced.  This  translates  into  the 
introduction  of  a  stop-codon  early  in  the  ATXN7  sequence.  This  most  likely  results  in  the  loss  of 
a  critical  C-terminal  domain  of  RAD51C,  without  the  addition  of  any  significant  sequence  of 
Ataxin  7.  Preliminary  data  confirming  this  truncation  is  shown  in  Task  1. 

b.  Recreate  join  with  cloning  techniques. 


Figure  13:  Test  expression  of  fusion  constructs.  293  cells  were  transfected  with  either  control  plasmid  (not 
shown),  ARFGEF2/SULF2,  or  DEPDC1B/ELOVL7  constructs.  Cells  were  fixed,  stained  with  an  anti-V5 
antibody,  and  imaged  by  fluorescence  microscopy. 


I  have  cloned  ARFGEF2/SULF2, 

DEPDC 1 B/ELO VL7  and  RAD5 1 C/ATXN7 
fusions  into  mammalian  expression  vectors 
by  performing  RT-PCR  on  MCF-7  cells.  The 
expression  of  the  ARFGEF2/SULF2,  and  the 
DEPDC  1  B/ELO VL7  vectors  have  been 
tested  by  transfecting  293  cells.  The  cells 
were  then  stained  with  an  anti-V5  antibody, 
and  analyzed  by  fluorescence  (Figure  13). 

c.  Perform  targeted  experiments  to 
determine  functional  consequence  of  the 
aberrant  join. 


A 


Time  indays 


Time  in  days 


Time  indays 


Timein  days 
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Control  siRNA  SULF2siRNA 


Figure  14:  A)  Cells  treated  with  SULF2  siRNA  have  an  enhanced  proliferation  then  cells  treated  with  control  siRNA.  B)  Cells 
treated  with  SULF2  siRNA  have  an  enhanced  survival  compared  to  cells  treated  with  control  siRNA.  C)  Treatment  of  MCF- 
7B  and  MDA  MB231  cells  with  siRNA  for SULF 2  increases  the  anchorage-independent  growth  capabilities. 


To  give  insight  into  the 
function  of  the 
ARF GEF2/SULF2  fusion, 
SULF2  mRNA  was  knocked 
down  using  siRNA 
specifically  targeting  SULF2 
in  MCF-7,  MDA  MB231  and 
MCF10A  cells.  All  three  cell 
lines  treated  with  SULF2 
siRNA  used  in  a  proliferation 
assay,  exhibited  an  advantage 
over  the  cells  treated  with 
control  siRNA.  Also,  cells 
treated  with  SULF2  siRNA 
showed  an  enhanced  survival. 
Cells  with  reduced  SULF2  die 
less,  and  recover  faster  in 
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serum  free  conditions  than  control  cells.  Knock-down  of  SULF2  mRNA  also  gave  a  clear 
advantage  in  anchorage-independent  growth  capability.  This  shows  that  knocking-down  SULF2 
enhances  the  tumorigenic  properties  in  multiple  breast  cell  lines,  and  that  SULF2  might  act  as  a 
tumor-suppressor  in  breast  cancer  development.  The  presence  of  this  ARFGEF2/SULF2  fusion 
might  mean  a  loss  of  function  of  the  wildtype  tumor  suppressor  Sulfatase  2  and  enhance  the 
tumorigenicity  of  MCF-7  cells. 

Key  research  Accomplishments 

•  Discovered  157  joins  in  MCF-7  cell  line,  of  which  only  few  have  been  previously  described. 

•  10  gene  fusion  were  discovered,  of  which  4  express  a  chimeric  mRNA  (3  new 
ARFGEF2/SULF2,  1  previously  described) 

•  3 1  of  the  157  are  present  in  all  7  MCF-7  sublines  tested.  This  allows  us  to  narrow  down  on 
‘true’  breakpoints  present  in  the  ancestor  of  the  MCF-7  cell  lines. 

.  Confirmed  RAD5 1 C/ATXN7  fusion  by  FISH  in  MCF-7  cell  line. 

•  Cloned  3  fusion  transcripts  (ARFGEF2/SULF2,  DEPDC 1 B/ELO VL7,  RAD5 1 C/ATXN7) 
into  mammalian  expression  vectors  by  amplification  of  the  fusion  transcript  by  RT-PCR. 

•  Discovery  of  RAD51C/ATXN7  fusion  transcript  in  two  other  breast  cancer  cell  lines  (T47D, 
and  MDA  MB361) 

•  Discovery  of  short  form  of  Rad51C  protein  in  MCF-7  and  MDA  MB361 

•  Sulfatase  2  acts  as  a  tumor  suppressor  in  breast  cancer  cell  lines,  and  might  be  dysfunctional 
after  generation  of  the  ARFGEF2/SULF2  fusion. 

Training  accomplishments 

•  Presented  twice  at  the  Research  and  Development  workshop  of  the  Breast  Center. 

•  Attended  and  presented  orally  data  at  the  Breast  Center/Cancer  Center  retreat  (November 
2008) 

•  Attended  and  presented  a  poster  at  the  LINK  meeting  (February  2009) 

•  Attended  and  presented  a  poster  at  the  Breast  Center/Cancer  Center  retreat  (September 
2009) 

•  Attended  weekly  the  Research  and  Development  workshop  of  the  Breast  Center 

•  Attended  bi-monthly  the  Journal  Club  of  the  Breast  Center 

•  Contributed  to  the  generation  of  data,  writing  and  editing  of  the  manuscript  published  in 
Genome  Research 

•  Attended  the  course  ‘Translational  Breast  Cancer’ 

•  Supervised  several  graduate  and  summer  students. 


Reportable  outcomes 

•  Hampton  OA,  den  Hollander  P,  Miller  CA,  Delgado  DA,  Li  J,  Coarfa  C,  Harris  RA, 
Richards  S,  Scherer  SE,  Muzny  DM,  Gibbs  RA,  Lee  AV,  Milosavljevic  A:  A  sequence-level 
map  of  chromosomal  breakpoints  in  the  MCF-7  breast  cancer  cell  line  yields  insights  into 
the  evolution  of  a  cancer  genome.  Genome  Research.  Feb;  19(2):  167-77  2009. 

•  Abstract  submission  for  the  Breast  Center  Retreat  November  2008  entitled:  Discovery  of 
functional  genomic  breakpoints  in  breast  cancer. 
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•  Abstract  submission  for  the  Breast  Center  Retreat  September  2009  entitled:  Evolution  of 
genomic  diversity  in  the  breast  cancer  cell  line  MCF-7. 

•  Abstract  submission  the  San  Antonio  Breast  Cancer  Symposium  December  2009  entitled: 
Evolution  of  genomic  diversity  in  the  breast  cancer  cell  line  MCF-7 

•  Oliver  A.  Hampton,  Christopher  A.  Miller,  Maxim  Koriabine,  Jian  Li,  Petra  Den 
Hollander,  Lucia  Carbone,  Mikhail  Nefedov,  Boudewijn  F.H.  Ten  Hallers,  Adrian  V.  Lee, 
Pieter  J.  De  Jong,  Aleksandar  Milosavljevic:  Long-range  massively  parallel  mate  pair 
sequencing  detects  distinct  mutations  and  similar  patterns  of  structural  mutability  in  two 
breast  cancer  cell  lines.  Cancer  Genetics.  204:  447-457  2011 

•  Evolution  of  genomic  diversity  in  the  breast  cancer  cell  line  MCF-7.  In  preparation. 

Personnel  receiving  pay  from  the  research  effort 
•  Petra  den  Hollander,  PhD 
Conclusion 

In  contrast  to  leukemias  and  lymphomas,  carcinomas  contain  more  complex  chromosomal 
rearrangements,  only  partially  detectable  using  classic  cytogenetic  methods.  Thus,  our 
knowledge  of  chromosomal  rearrangements  in  solid  tumors  is  very  limited,  and  “gene  fusions” 
defining  a  specific  type  of  solid  tumor  have  not  yet  been  characterized.  This  lack  of  knowledge 
has  supported  the  paradigm  that  chromosomal  rearrangements  leading  to  gene  fusions  are  almost 
exclusively  seen  in  hematologic  malignancies  and  are  extremely  rare  (maybe  <1%)  in  solid 
tumors. 

Here  we  set  out  to  discover  the  chromosomal  rearrangements  that  are  important  in  breast  cancer. 
The  data  presented  here  shows  that  there  are  indeed  breakpoints  that  have  a  functional 
significance  in  breast  cancer  cell  lines.  I  even  discovered  a  fusion  that  is  present  in  two  other 
breast  cancer  cell  lines  besides  MCF-7.  The  data  shown  on  the  ARFGEF2/SULF2  and 
RAD51C/ATXN7  fusions  indicate  that  we  discovered  novel  strategy  of  the  tumor  cells  to  silence 
important  tumor  suppressors. 

To  gain  insight  into  the  heterogeneity  of  genomic  breakpoints,  in  seven  MCF-7  sub-lines.  There 
is  an  enrichment  for  breakpoints  containing  genes  (50.3%  vs  77.4%),  and  for  fusion-containing 
breakpoints  (6.4%  vs  16.1%).  When  studying  cell  lines  originating  from  a  single  cell,  we 
discovered  that  there  is  very  little  genetic  variability  between  them.  A  large  effort  has  gone  into 
the  development  of  next-generation  sequencing  techniques  for  the  discovery  of  genomic  breaks 
and  fusion  in  the  breast  cancer  genome.  We  developed  new  techniques  and  validated  them  with 
standard  PCR.  The  validation  rate  is  very  promising,  and  these  new  techniques  will  aid  us  in  the 
discovery  of  breast  cancer. 


11 


Downloaded  from  genome.cshlp.org  on  October  12,  2009  -  Published  by  Cold  Spring  Harbor  Laboratory  Press 


A  sequence-level  map  of  chromosomal  breakpoints  in  the  MCF-7 
breast  cancer  cell  line  yields  insights  into  the  evolution  of  a  cancer 
genome 

Oliver  A.  Hampton,  Petra  Den  Hollander,  Christopher  A.  Miller,  et  al. 

Genome  Res.  2009  19: 167-177  originally  published  online  December  3,  2008 
Access  the  most  recent  version  at  doi:1 0.1 1 01 /gr.080259.1 08 


Supplemental  http://genome.cshlp.org/content/suppl/2009/01/14/gr.080259.108.DC1  .html 

Material 


References  This  article  cites  41  articles,  16  of  which  can  be  accessed  free  at: 

http://genome.cshlp.Org/content/19/2/1 67.full.html#ref-list-1 


Article  cited  in: 

http://genome.cshlp.Org/content/19/2/1 67.full.html#related-urls 

Open  Access  Freely  available  online  through  the  Genome  Research  open  access  option. 

Email  alerting  Receive  free  email  alerts  when  new  articles  cite  this  article  -  sign  up  in  the  box  at  the 

service  fop  right  corner  of  the  article  or  click  here 


To  subscribe  to  Genome  Research  go  to: 

http://genome.cshlp.org/subscriptions 


Copyright  ©  2009  by  Cold  Spring  Harbor  Laboratory  Press 


Downloaded  from  genome.cshlp.org  on  October  12,  2009  -  Published  by  Cold  Spring  Harbor  Laboratory  Press 


A  sequence-level  map  of  chromosomal  breakpoints 
in  the  MCF-7  breast  cancer  cell  line  yields  insights 
into  the  evolution  of  a  cancer  genome 

Oliver  A.  Hampton,1,3,5  Petra  Den  Hollander,4,5  Christopher  A.  Miller,1,3 
David  A.  Delgado,4,5  Jian  Li,1,3  Cristian  Coarfa,1,2  Ronald  A.  Harris,1,2 
Stephen  Richards,2  Steven  E.  Scherer,2  Donna  M.  Muzny,2  Richard  A.  Gibbs,2,3 
Adrian  V.  Lee,4,5,6  and  Aleksandar  Milosavljevic1,2,3,5,7 

1  Bioinformatics  Research  Laboratory ,  Baylor  College  of  Medicine,  Houston ,  Texas  77030 ,  USA;  2 Human  Genome  Sequencing  Center ; 
Department  of  Molecular  and  Human  Genetics ,  Baylor  College  of  Medicine,  Houston,  Texas  77030,  USA;  3 Program  in  Structural  and 
Computational  Biology  and  Molecular  Biophysics,  Baylor  College  of  Medicine,  Houston,  Texas  77030,  USA;  4 Breast  Center ;  Baylor 
College  of  Medicine,  Houston,  Texas  77030,  USA; 5 Dan  L.  Duncan  Cancer  Center,  Baylor  College  of  Medicine,  Houston,  Texas  77030, 
USA;  6 Department  of  Medicine,  Baylor  College  of  Medicine,  Houston,  Texas  77030,  USA 


By  applying  a  method  that  combines  end-sequence  profiling  and  massively  parallel  sequencing,  we  obtained  a  sequence- 
level  map  of  chromosomal  aberrations  in  the  genome  of  the  MCF-7  breast  cancer  cell  line.  A  total  of  157  distinct  somatic 
breakpoints  of  two  distinct  types,  dispersed  and  clustered,  were  identified.  A  total  of  89  breakpoints  are  evenly  dispersed 
across  the  genome.  A  majority  of  dispersed  breakpoints  are  in  regions  of  low  copy  repeats  (LCRs),  indicating  a  possible 
role  for  LCRs  in  chromosome  breakage.  The  remaining  68  breakpoints  form  four  distinct  clusters  of  closely  spaced 
breakpoints  that  coincide  with  the  four  highly  amplified  regions  in  MCF-7  detected  by  array  CGH  located  in  the  lp!5.1- 
p21.1,  5pl4.1-pl4.2, 17q22-q24.5,  and  20ql2-ql5.55  chromosomal  cytobands.  The  clustered  breakpoints  are  not  signifi¬ 
cantly  associated  with  LCRs.  Sequences  flanking  most  (95%)  breakpoint  junctions  are  consistent  with  double-stranded 
DNA  break  repair  by  nonhomologous  end-joining  or  template  switching.  A  total  of  79  known  or  predicted  genes  are 
involved  in  rearrangement  events,  including  10  fusions  of  coding  exons  from  different  genes  and  77  other  rearrangements. 
Four  fusions  result  in  novel  expressed  chimeric  mRNA  transcripts.  One  of  the  four  expressed  fusion  products  [RAD51C- 
ATXN7]  and  one  gene  truncation  [BRIPI  or  BACHt)  involve  genes  coding  for  members  of  protein  complexes  responsible  for 
homology-driven  repair  of  double-stranded  DNA  breaks.  Another  one  of  the  four  expressed  fusion  products  ( ARFGEF2 - 
SULF2)  involves  SULF2,  a  regulator  of  cell  growth  and  angiogenesis.  We  show  that  knock-down  of  SULF2  in  cell  lines  causes 
tumorigenic  phenotypes,  including  increased  proliferation,  enhanced  survival,  and  increased  anchorage-independent 
growth. 

[Supplemental  material  is  available  online  at  www.genome.org  and  through  the  Breast  Cancer  project  page  at 
www.genboree.org.  All  MCF-7  BAC  clones  are  available  from  Amplicon  Express  under  name  HTA  and  plate/ row/ 
column  names  as  indicated.  The  sequence  data  from  this  study  have  been  submitted  to  the  NCBI  Trace  and  Short  Read 
Archives  (http://www.ncbi.nlm.nih.gov)  under  accession  nos.  2172854909-2172901416  and  2172904852-2172911164,  and 
SRR006762-SRR006767,  respectively]. 


Many  cancer  genomes  are  characterized  by  mutability  including 
microsatellite  instability  (MIN)  and  chromosomal  instability 
(CIN)  (Lengauer  et  al.  1998).  It  is  now  generally  anticipated  that 
sequencing  of  cancer  genomes  using  massively  parallel  sequenc¬ 
ing  technologies  (Korbel  et  al.  2007;  Campbell  et  al.  2008)  will 
provide  insights  into  structural  mutability.  Recent  sequencing  of 
four  cancer  amplicons  (Bignell  et  al.  2007)  derived  from  the 
HCC1954  breast  cancer  cell  line  and  two  lung  cancer  cell  lines 
provided  evidence  for  homologous  and  nonhomologous  repair  of 
double-strand  DNA  breaks  induced  by  the  breakage-fusion-bridge 
(BFB)  mechanism. 
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Gene  fusions  and  truncations  that  result  from  chromosomal 
rearrangements  provide  insight  into  the  molecular  mechanisms  of 
cancer  progression.  Recurrent  rearrangements  of  specific  genes  in¬ 
dicate  increased  mutability  or  positive  selection  (or  a  combination 
of  both)  in  the  evolution  of  tumor  genomes.  Recurrent  fusions, 
translocations,  and  other  aberrant  joins  are  used  as  highly  in¬ 
formative  diagnostic  and  prognostic  markers  and  drug  targets  in 
leukemias,  lymphomas,  and  sarcomas.  A  total  of  337  genes  involved 
in  fusions  in  cancer  genomes  have  been  recently  surveyed  (Mitel- 
man  et  al.  2007).  Four  gene  fusions  have  previously  been  reported 
in  breast  carcinomas  ( ETV6-NTRK3 ,  ODZ4-NRG1 ,  TBL1XR1- 
RGS17,  BCAS3-BCAS4 )  (Mitelman  et  al.  2007,  Ruan  et  al.  2007). 

Breast  cancer  and  carcinomas  in  general  have  proven  less 
tractable  to  fusion  discovery  due  to  the  typically  higher  degree  of 
rearrangement.  However,  a  prognostically  significant  rearrange¬ 
ment  was  recently  discovered  in  the  majority  of  prostate  cancers 
(Tomlins  et  al.  2005).  Of  note,  the  initial  discovery  was  not  iden- 


19:167-177  ©  2009  by  Cold  Spring  Harbor  Laboratory  Press;  ISSN  1088-9051/09;  www.genome.org 


Genome  Research  1 67 

www.genome.org 


Downloaded  from  genome.cshlp.org  on  October  12,  2009  -  Published  by  Cold  Spring  Harbor  Laboratory  Press 


Hampton  et  al. 


tified  by  analyzing  DNA  sequence  or  structure,  but  via  the  analysis 
of  outlier  gene  expression,  followed  by  a  targeted  locus-specific 
search  for  a  fusion  in  genomic  DNA.  Here  we  demonstrate 
a  method  to  detect  gene  fusions  directly  by  the  analysis  of  geno¬ 
mic  DNA,  even  in  highly  rearranged  breast  cancer. 

MCF-7  is  the  most  widely  used  cell  line  model  for  estrogen¬ 
positive  breast  cancer.  The  cell  line  has  been  derived  from  a  pleural 
effusion  taken  from  a  patient  with  metastatic  breast  carcinoma 
(Soule  et  al.  1973).  Evidence  of  CIN  in  MCF-7  comes  from  appar¬ 
ent  aneuploidy  and  significant  genomic  divergence  in  several 
sublines  (Jones  et  al.  2000;  Nugoli  et  al.  2003).  Chromosomal 
aberrations  in  MCF-7  have  previously  been  studied  by  spectral 
karyotyping  (Kytola  et  al.  2000;  Rummukainen  et  al.  2001), 
comparative  genomic  hybridization  (CGH)  (Kytola  et  al.  2000; 
Rummukainen  et  al.  2001),  array  CGH  (Neve  et  al.  2006;  Shadeo 
and  Lam  2006;  Jonsson  et  al.  2007),  single  nucleotide  poly¬ 
morphism  arrays  (Huang  et  al.  2004),  and  gene  expression  arrays 
(Neve  et  al.  2006). 

More  recently,  bacterial  artificial  chromosome  (B AC) -based 
end  sequence  profiling  (ESP)  (Volik  et  al.  2003,  2006;  Raphael  et  al. 
2008)  has  been  applied  to  study  genomic  rearrangements  in  can¬ 
cer  genomes.  Volik  and  colleagues  sequenced  a  total  of  19,831 
BAC  ends  from  the  Amplicon  Express  MCF-7  BAC  library,  ~lx 
clone  coverage  of  the  human  genome,  to  identify  582  BACs  con¬ 
taining  rearrangements. 

As  a  starting  point  for  our  analysis,  we  constructed  BAC  pools 
from  a  nonredundant  subset  ( n  =  552)  of  rearranged  BACs  iden¬ 
tified  by  Volik  et  al.  (2003,  2006).  To  map  chromosomal  aberrations 
in  the  genome  of  the  MCF-7  breast  cancer  cell  line  at  sequence 
level  resolution,  we  developed  a  method  that  combines  end- 
sequence  profiling  and  massively  parallel  sequencing.  By  analyzing 
sequences  of  the  chromosomal  breakpoints  in  the  BAC  pools, 
we  gained  insights  into  the  mechanisms  of  chromosomal  insta¬ 
bility  and  repair.  Specific  gene  fusions  and  truncations  that  have 
emerged  during  the  pathological  evolution  of  this  cancer  genome 
point  to  the  molecular  mechanisms  of  the  disease.  Additional 
products  of  our  research  are  benchmarking  reagents  for  the  de¬ 
velopment  of  a  new  generation  of  methods  for  detecting  structural 
genome  variation,  including  well-characterized  BAC  pools  and 
validated  breakpoints  in  the  MCF-7  genome. 

Results 

At  least  157  breakpoints  were  induced  by  somatic 
rearrangements  in  MCF-7 

Aberrant  breakpoint-induced  joins  were  identified  by  combining 
"bridging"  and  "outlining"  steps,  as  illustrated  in  Figure  1A.  The 
bridging  step  utilizes  end-sequence  information  from  fosmid-sized 
clone  inserts  to  connect  chromosomal  loci  brought  together  at 
aberrant  rearrangement-induced  joins  in  the  cancer  genome.  End- 
sequences  of  breakpoint-spanning  fosmids  were  recognized  as 
those  that  do  not  map  onto  the  reference  genome  in  a  manner 
consistent  with  the  clone  insert  size  or  end-sequence  orientation. 
The  outlining  step  involves  a  precise  localization  of  breakpoint 
sites  by  mapping  short  tags  generated  by  the  454  Life  Sciences 
(Roche)  pyrosequencing  machine  onto  the  reference  genome. 

As  illustrated  in  Figure  IB,  three  pools,  each  containing  192 
BACs  containing  putative  rearrangements,  were  constructed  for 
the  purpose  of  massively  parallel  sequencing  using  the  454  GS 
sequencing  machine.  Approximately  300,000  short  (~  100-bp) 
reads  were  sequenced  from  each  pool,  providing  ~lx  sequence 


coverage  for  the  purpose  of  outlining.  Six  96-BAC  pools  were 
formed  from  the  same  set  of  BACs  for  the  purpose  of  fosmid  library 
preparation,  end-sequencing  and  bridging.  Approximately  8000 
to  10,000  fosmid  inserts  from  each  of  the  six  pools  were  end- 
sequenced,  providing  24 X  clone  coverage  and  ~lx  sequence 
coverage  for  the  purpose  of  bridging. 

Upon  sequencing,  the  fosmid  end-reads  and  the  454  reads  to¬ 
gether  with  the  BAC  end-sequences  produced  by  Volik  et  al.  (2003, 
2006)  were  mapped  onto  the  reference  human  genome.  In¬ 
dependent  aberrant  mapping  of  two  fosmids  across  a  specific  puta¬ 
tive  breakpoint  was  considered  to  constitute  sufficient  evidence  to 
declare  the  breakpoint.  BAC  or  fosmid  ends  that  map  onto  different 
chromosomes  are  interpreted  as  interchromosomal  breakpoints. 
The  outlined  regions  were  bridged  using  end-sequences  from  BACs 
and  fosmids.  The  combination  of  outlining  and  bridging  enabled 
identification  of  breakpoint  locations  down  to  a  PCR-able  distance. 
As  indicated  in  Figure  1C,  out  of  the  total  of  410  detected  break¬ 
points,  157  could  be  confirmed  by  PCR  across  breakpoint  joins  as 
likely  distinct  somatic  mutations.  As  indicated  by  the  bars  in  the 
middle  of  Figure  1C,  the  remaining  breakpoints  failed  the  confir¬ 
mation  process  for  a  number  of  different  reasons,  as  we  explain  next. 

A  total  of  47  breakpoints  could  not  be  unambiguously  resolved 
down  to  a  PCR-able  distance  using  the  outlining  method.  PCR 
primers  were  designed  for  the  remaining  breakpoints  using  a  semi- 
automated  primer  design  pipeline.  When  applied  to  pooled  BACs, 
PCR  primers  failed  to  generate  amplicons  in  expected  size  range  for 
23  predicted  breakpoint  joins.  Further  confirmation  included  am¬ 
plification  of  a  pool  of  genomic  DNA  from  six  MCF-7  cell  lines  (B, 
BK,  C,  D,  L,  and  Neo).  DNA  isolated  from  MCF-10A  and  normal 
human  female  DNA  (Novagen)  were  used  as  negative  controls.  A 
total  of  123  PCR  primer  pairs  that  produced  amplicons  from  the 
BAC  pool  did  not  produce  amplicons  from  the  genomic  DNA  de¬ 
rived  from  cell  pools.  A  majority  of  these  breakpoint  sites  contained 
Hindlll  restriction  sites.  Since  the  BAC  library  was  prepared  using 
Hindlll  partial-digestion  of  genomic  DNA,  those  breakpoints  were 
most  likely  created  by  fusion  of  digestion  products  in  the  course  of 
BAC  library  preparation.  Other  sources  of  this  discrepancy  may 
include  a  number  of  cell  line-specific  aberrations  generated  over 
a  number  of  passages  that  preceded  preparation  of  the  BAC  library. 

To  identify  structural  polymorphic  variants  present  in  the 
germline  of  the  MCF-7  donor,  PCR  amplification  of  breakpoint 
joins  was  performed  on  a  pool  of  90  Caucasian  HapMap  genomes 
(International  HapMap  Consortium  2005).  Additionally,  search 
for  occurrences  of  the  apparently  somatic  joins  was  performed  in 
publicly  available  genomic  sequences  using  the  Pash  program 
(Kalafus  et  al.  2004).  A  total  of  40  apparently  aberrant  joins  were 
present  in  the  HapMap  samples,  as  indicated  by  the  presence  of 
a  PCR  product,  and  thus  correspond  to  structural  alleles  different 
from  the  structural  alleles  represented  in  the  reference  genome 
assembly.  Finally,  some  breakpoints  were  identified  to  occur  in 
more  than  one  BAC,  and  the  count  was  reduced  by  20  to  elimi¬ 
nate  multiple  counting,  resulting  in  a  total  of  157  unique  con¬ 
firmed  somatic  breakpoint  joins  in  the  MCF-7  genome.  Of  the  157 
MCF-7  somatic  breast  cancer  breakpoints,  74  (47%)  formed  in¬ 
terchromosomal  and  83  (53%)  intrachromosomal  joins,  as  illus¬ 
trated  in  Figure  2,  A  and  B. 

A  majority  of  the  somatic  breakpoints  could  be  assigned 
to  specific  BACs 

If  a  chromosomal  segment  outlined  by  454  reads  connected  a  BAC 
end-sequence  and  a  breakpoint-spanning  fosmid  end-sequence, 
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Figure  1.  ( A )  An  illustration  of  the  principle  of  the  method.  Breakpoints  within  a  BAC  containing  segments  from  chromosomes  20,  3,  and  17  are 
detected  using  a  combination  of  "bridging"  and  "outlining"  steps.  The  bridging  step  maps  fosmid  end-sequences  onto  the  reference  genome.  The 
outlining  step  maps  short  tags  (labeled  "PyroSeqs")  using  454  technology  from  the  BAC  (in  practice  a  pool  of  BACs)  onto  the  reference  genome.  The 
results  of  bridging  and  outlining  jointly  allow  precise  mapping  of  breakpoints  and  reconstruction  of  rearranged  BACs.  (B)  Organization  of  the  mapping 
experiment.  The  nonredundant  collection  of  552  rearrangement  containing  BACs,  1  7  normal  BAC  negative  controls,  and  seven  positive  controls  was 
arrayed  in  six  96-well  plates  and  pooled  as  indicated.  Three  454  sequencing  reactions  (involving  BACs  pooled  from  plate  pairs)  produced  tags  for  the 
purpose  of  outlining.  Six  fosmid  libraries  (one  from  each  96-well  plate  pool  of  BACs)  were  constructed  for  Sanger-based  sequencing  of  fosmid  ends  and 
bridging.  (Q  Bar  charts  detailing  the  classification  of  detected  MCF-7  breakpoints. 


the  breakpoint  could  be  associated  with  the  BAC.  Out  of  552 
pooled  BACs,  at  least  one  breakpoint  could  be  assigned  to  316 
(57%)  of  them.  The  remaining  BACs  fall  into  the  following  two 
groups:  First,  in  129  (23%)  cases,  breakpoint  assignment  was  in¬ 
conclusive  due  to  ambiguous  mapping  of  reads  onto  the  reference 
genome,  mostly  due  to  repetitive  DNA  regions,  apparent  overlaps 
between  BACs,  and  other  causes;  second,  in  107  (20%)  cases, 
a  single  outlining  block  connected  BAC  ends,  thus  indicating  lack 
of  any  rearrangement,  contrary  to  previous  reports  (Volik  et  al. 
2003,  2006). 

To  examine  the  source  of  the  disagreement  with  the  previous 
reports,  the  107  disagreements  were  examined  in  detail.  Most  of  the 
disagreements  could  be  explained  either  by  the  differences  between 
reference  genome  assemblies  used  in  the  previous  and  current 
studies  or  by  mismapping  of  BAC-end  sequence  reads  or  by  a  com¬ 
bination  of  the  two  factors.  Assemblies  used  in  the  previous  studies 
were  NCBI  Build  30  of  June  2002  (Volik  et  al.  2003)  and  NCBI  Build 
34  of  July  2003  (Volik  et  al.  2006),  while  our  study  employed  NCBI 
Build  36  of  March  2006.  The  newer  assembly  is  more  likely  to  be 
more  correct  and  complete,  but  some  of  the  disagreements  may  also 
be  explained  by  the  presence  of  different  stmctural  alleles  at  sites  of 
structural  polymorphisms.  The  disagreements  tended  to  occur  in 
regions  containing  low  copy  repeats  (LCRs).  For  example,  Volik 
et  al.  (2003)  identified  MCF-7  BAC  9110  as  bridging  apparent 
translocation  t(ll;ll)(pll.l2;ql4.3)  and  apparently  confirmed  the 


rearrangement  by  fluorescent  in  situ  hybridization  (FISH).  Exami¬ 
nation  of  Build  36  reveals  copies  of  an  LCR  at  both  lip  11.12  and 
llql4.3.  The  LCR  was  absent  from  Builds  30  and  34,  thus 
explaining  the  aberrant  BAC-end  sequence  mapping  and  even  the 
erroneous  "confirmation"  by  FISH. 

Examination  of  breakpoint  sequences  reveals  signatures 
of  DSB  repair 

To  examine  breakpoints  at  the  sequence  level,  all  the  157  break¬ 
point-spanning  amplicons  were  used  as  substrates  for  sequencing 
from  both  ends.  Most  amplicons  were  of  small  enough  size  (less 
than  1  kb  on  average),  allowing  the  Sanger  read  from  at  least  one 
of  the  ends  to  reach  the  breakpoint.  Difficultly  of  sequencing 
across  breakpoints  has  been  documented  (Lee  et  al.  2007;  Liu  and 
Carson  2007),  especially  in  repeat-rich  regions.  To  ameliorate  the 
problem,  we  sequenced  DNA  from  specific  BAC  pools  and 
employed  nested  sequencing  primers  in  cases  of  first-pass  se¬ 
quencing  failures.  Breakpoint-straddling  sequence  could  be 
obtained  from  86  (55%)  amplicons  and  could  not  be  obtained  for 
the  remaining  71  (45%).  Many  of  the  failures  were  due  to  inability 
to  design  unique  primers  for  sequencing  across  breakpoints  that 
fall  within  repeat-rich  regions. 

Examination  of  86  breakpoints  that  could  be  resolved  to  the 
base  pair  level  (summarized  in  the  chart  in  the  middle  of  Lig.  2B) 
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Figure  2.  (4)  Circular  visualization  of  the  MCF-7  genome  obtained  using  Circos  software.  Chromosomes  are  individually  colored  with  centromeres  in 
white  and  LCR  regions  in  black.  MCF-7  BAC  array  comparative  genome  hybridization  data  (Jonsson  et  al.  2007)  are  plotted  with  gains  in  green  and  losses 
in  red  using  log2ratio.  The  inner  chromosome  annotations  depict  1 57  somatic  MCF-7  breast  tumor  chromosomal  rearrangements  associated  with  LCRs 
(black)  and  breakpoints  not  associated  with  LCRs  (green).  Chromosomal  rearrangements  are  depicted  on  each  side  of  the  MCF-7  breakpoints;  intra- 
chromosomal  rearrangements  (blue)  are  located  outside  and  interchromosomal  rearrangements  (red)  are  located  in  the  center  of  the  circle.  ( B )  Bar  charts 
indicating  classification  of  somatic  breakpoints  in  MCF-7. 
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revealed  14  flush  joins  without  evidence  of  microhomology  or 
intervening  sequence,  29  joins  with  intervening  inserts  of  un¬ 
known  genomic  origin  averaging  over  100  bp  in  length,  and  43 
joins  where  the  joined  segments  exhibit  homology.  The  extent  of 
homology  was  in  most  (88%)  cases  restricted  to  <7  bp,  consistent 
with  microhomology  observed  in  double-stranded  breaks  repaired 
by  nonhomologous  end- joining  (NHEJ)  or  template  switching 
(Sonoda  et  al.  2006).  Due  to  the  absence  of  straddling  sequence, 
the  remaining  71  breakpoints  could  only  be  analyzed  at  the  ~1- 
kbp  level  of  resolution. 

Out  of  the  86  somatic  breakpoints  isolated  to  base  pair  reso¬ 
lution,  only  four  (5%)  exhibited  sequence  patterns — sequence 
identity  and  equal  crossover  between  two  homologous  loci — 
consistent  with  nonallelic  homologous  recombination  (NAHR) 
(chart  on  the  right  of  Fig.  2B).  The  dominant  mechanism  re¬ 
sponsible  for  the  repair  of  double-strand  breaks  in  MCF-7  therefore 
appears  to  be  NHEJ  or  template  switching. 

Two  distinct  types  of  breakpoints  exist  in  MCF-7-cIustered 
and  LCR-associated 

As  evident  from  Figure  2,  the  breakpoints  in  MCF-7  are  not  evenly 
distributed  across  the  genome.  A  number  of  clusters  of  closely 
spaced  breakpoints  are  evident.  To  formally  delineate  the  clustered 
breakpoints  from  the  remainder,  clusters  of  eight  or  more  break¬ 
points  that  are  less  than  1.1  Mbp  apart  were  identified.  Four  such 
clusters  emerged  in  the  following  locations:  lpl3. 1-21.1,  3pl4.1- 
pl4.2,  17q22-q24.3,  and  20ql2-ql3.33.  These  four  rearrangement 
clusters,  illustrated  in  Figure  3 A,  contain  43%  of  all  MCF-7  somatic 
breakpoints,  while  representing  only  1.5%  of  the  normal  reference 
genome. 

The  remaining  nonclustered  or  dispersed  breakpoints  are 
highly  associated  with  FCRs,  showing  a  5.2-fold  enrichment  for 
the  presence  of  FCRs  at  the  breakpoint  site  (P-value  =  2.9  X  10-22; 
see  Fig.  3B).  This  is  in  contrast  to  the  clustered  breakpoints  that  do 
not  exhibit  enrichment  for  FCRs,  with  only  five  out  of  68  clustered 
breakpoints  being  FCR-associated,  well  within  the  number 
expected  by  chance.  Moreover,  as  illustrated  in  Figure  3C,  the  four 
clustered  breakpoint  locations  exactly  coincide  with  high  copy 
number  gain  regions  ("firestorms,"  the  term  proposed  by  Hicks 
et  al.  [2006])  in  the  MCF-7  genome  described  by  Jonsson  et  al.  (2007) 
and  contain  prognostic  gene  markers  for  breast  cancer. 

To  further  examine  possible  differences  between  the  clustered 
breakpoints  and  the  dispersed  ones,  we  identified  regions  that 
show  recurrent  copy  number  amplification  in  cancer  in  previous 
studies  involving  145  breast  tumors  and  56  breast  cancer  cell  lines 
(Chin  et  al.  2006;  Neve  et  al.  2006;  Shadeo  and  Earn  2006;  Jonsson 
et  al.  2007).  As  illustrated  in  Supplemental  Figure  5,  almost  three- 
fourths  of  breakpoints  occurring  in  the  four  clusters  are  highly 
recurrently  amplified  (high  recurrence  is  declared  if  at  least  20%  of 
the  surveyed  samples  show  amplification),  a  greater  than  twofold 
enrichment  over  other  (dispersed)  breakpoints.  Additionally,  the 
mean  number  of  amplifications  at  each  breakpoint  location  is 
significantly  higher  among  clustered  vs.  dispersed  breakpoints. 
These  data  suggest  that  genomic  instability  in  these  cluster  regions 
is  not  specific  to  MCF-7. 

Novel  chimeiric  transcripts  could  be  predicted  based 
on  fusions  of  genomic  DNA 

Among  the  breakpoint  fusions  that  involved  genes,  we  first  focused 
on  those  that  occurred  within  introns  and  are  predicted  to  lead  to 


chimeric  transcripts.  We  discovered  10  gene  fusions  (Table  1)  where 
fusion  breakpoints  reside  in  intronic  regions  of  the  genes  involved, 
implying  in-frame  translation  of  the  original  amino  acid  sequences. 

To  determine  if  the  predicted  chimeric  mRNA  transcript  was 
created  by  these  genomic  fusions,  we  performed  gene-specific  re¬ 
verse  transcriptase  reactions  and  a  fusion-specific  PCR  on  RNA 
extracted  from  MCF-7,  MCF-10A,  and  normal  breast  tissue  (the 
latter  two  serving  as  negative  controls).  Since  the  primers  were 
designed  to  amplify  the  fusion  product  specifically,  a  band  was 
only  generated  if  a  fusion  product  was  present  (for  primers  se¬ 
quence  see  Supplemental  Table  4).  Out  of  10  fusions,  four  showed 
a  fusion  mRNA  transcript  by  RT-PCR,  see  Figure  4. 

To  identify  if  other  sources  reported  the  same  fusion  tran¬ 
scripts  in  MCF-7,  other  cell  lines  or  primary  tumors,  we  queried  70 
MCF-7  and  HCT116  (colon  cancer)  paired-end  ditag  fusion  tran¬ 
script  sets  reported  by  Ruan  et  al.  (2007)  and  237  fusion  transcripts 
from  the  Cancer  Genome  Anatomy  Project  Recurrent  Chromo¬ 
some  Aberrations  in  Cancer  database  reported  by  Hahn  et  al. 
(2004).  Of  the  10  MCF-7  gene  fusions  identified  by  our  bridging 
and  outlining  method,  the  BCAS3-BCAS4  fusion  was  found  to  be 
previously  characterized  Ruan  et  al.  (2007)  Interestingly,  the 
BCAS3-BCAS4  fusion  is  recurrently  present  in  both  the  MCF-7 
breast  cancer  and  HCT116  colon  cancer  cell  lines. 

Some  of  the  fusions  and  truncations  may  suppress  function 
of  normal  gene  product 

Most  fusions  involve  highly  amplified  clustered  breakpoints,  in¬ 
dicating  possible  positive  selection  and  therefore  functional  sig¬ 
nificance.  This  is  consistent  with  the  fact  that  firestorm  patterns 
indicate  poor  prognosis  (Hicks  et  al.  2006)  and  that  these  highly 
amplified  regions  contain  specific  prognostic  markers  (Jonsson 
et  al.  2007).  However,  not  all  the  amplified  loci  contain  onco¬ 
genes.  Analysis  and  results  below  indicate  that  the  oncogenic 
effects  of  some  of  the  fusions  may  in  fact  be  due  to  a  suppression  of 
normal  function  of  a  tumor  suppressor  gene.  Observed  amplifi¬ 
cation  of  gene  fusions  involving  tumor  suppressors  is  consistent 
with  a  dominant-negative  effect  of  such  gene  fusions. 

For  example,  the  first  two  exons  of  PTPRG,  comprising 
the  carbonic  anhydrase-like  domain,  are  replaced  by  the  first 
10  exons  of  the  unannotated  inter-species  ASTN2  gene. 
Promoter  hypermethylation  in  PTPRG  in  T-cell  lymphoma  leads  to 
loss  of  gene  expression  and  correlates  with  poor  prognosis  (van 
Doom  et  al.  2005).  Interestingly,  Murine  F  cells  producing  PTPRG 
transcripts  with  a  homozygous  deletion  of  the  carbonic  anhy¬ 
drase-like  domain  causes  sarcomas  in  syngeneic  mice  (Wary  et  al. 
1993). 

To  examine  the  effects  of  a  possible  suppression  of  SULF2 
function  by  the  ARFGEF2-SULF2  fusion,  SULP2  mRNA  was  knocked 
down  using  siRNA  specifically  targeting  SULF2  in  MCF-7B,  MDA 
MB231,  and  MCF-10A  cells  (Supplemental  Fig.  6).  Proliferation 
assays  were  performed  on  the  three  cell  lines  treated  with  knocked 
down  SULF2,  and  all  exhibited  an  advantage  over  the  cells  treated 
with  control  siRNA  (Fig.  5A-C).  To  determine  the  effect  on  survival 
capabilities  under  stress  conditions,  SULF2  siRNA  and  control 
siRNA  treated  cells  were  plated  in  serum-free  conditions.  Results 
indicate  (Fig.  5D-F)  that  cells  with  knocked  down  SULF2  survive 
better,  and  recover  faster  (seen  by  the  steeper  slope)  in  serum-free 
conditions  then  the  control  cells.  This  implies  that  knock-down  of 
SULF2  enhances  survival  compared  to  the  control  cells.  Finally, 
knock-down  of  SULF2  mRNA  caused  a  twofold  increase  in  an¬ 
chorage-independent  growth  in  MCF-7B  and  a  threefold  increase 
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Figure  3.  ( A )  Four  clusters  of  breakpoints  at  1  pi  3.1  -21 .1 ,  3p1 4.1  -pi  4.2,  1  7q22-q24.3,  and  20q1 2-ql  3.33.  (8)  Low  copy  repeat  (LCR)  association  with 
clustered  and  dispersed  breakpoints.  (Q  The  four  clusters  of  breakpoints  correspond  exactly  to  the  four  highly  amplified  regions  in  MCF-7,  as  determined 
by  array  CGH. 


in  MDA  MB231,  as  measured  by  the  amount  of  colonies  compared 
with  controls  (Fig.  5H).  In  summary,  the  data  indicate  that  knock¬ 
down  of  SULF2  causes  tumorigenic  phenotypes,  including  in¬ 
creased  proliferation,  enhanced  survival,  and  increased  anchor¬ 
age-independent  growth.  SULF2  may  therefore  act  as  a  breast 
cancer  suppressor. 


Some  genes  are  involved  in  numerous  rearrangements 

In  addition  to  the  10  gene-gene  fusions,  a  total  of  77  genes  were 
otherwise  affected  by  the  157  breakpoints.  We  jointly  refer  to  those 
events  as  "truncations"  even  though  some,  in  fact,  involve  fusion  of 
an  upstream  promoter  with  a  protein  coding  gene.  PTPRG  and 
other  genes  were  affected  by  multiple  breakpoints,  including  both 


Table  1.  Gene  fusions  in  MCF-7  that  involve  splicing  of  intact  coding  exons 


Associated  genes 

Rearrangement  type 

Cytoband  translocation 

Comment 

ARFGEF2-SULF2 

Intrachromosomal  inversion 

20q1  3.1  3-20q1  3.1  3 

Fusion  of  ARFGEF2  exon  1  to  SULF2  exons  3-21  ; 

1 .2-Mb  inversion 

DEPDC1 B-ELOVL7 

Intrachromosomal  translocation 

5q1 2.1  -5q1 2.1 

Fusion  of  DEPDG1 B  N  terminus  exons  1-7  (out  of  1 1) 
with  ELOVL7  exons  8-9 

RAD51C-ATXN7 

Interchromosomal  rearrangement 

3p1 4.1  -1  7q22 

Fusion  of  RAD51 G  exons  1-7  (out  of  nine)  with 

ATXN7  exons  6-1  3 

SULF2-PRICKLE2 

Interchromosomal  rearrangement 

3p1 4.1  -20q1  3.1  3 

Fusion  of  SULF2  exon  1  with  last  exon  of  PRIGKLE2 

NPEPPS-USP32 

Intrachromosomal  inversion 

1  7q21 .32-1  7q23.2 

Fusion  of  NPEPPS  exons  1-1 2  (out  of  23)  with  USP32 
exons  2-4;  1  3-Mb  inversion 

ASTN2-PTPRG 

Interchromosomal  rearrangement 

3p14.2-9q33.1 

Fusion  of  ASTN2  exons  1-1 0  (out  of  22)  with  PTPRG 
exons  3-30 

BGAS3-BGAS4 

Interchromosomal  rearrangement 

1  7q23.2-20q1  3.1  3 

BGAS4  exon  1  fused  to  BGAS3  exons  23-24;  also  found 
by  Ruan  et  al.  (2007) 

BGAS3-RSBN1 

Interchromosomal  rearrangement 

1  pi  3.2-1  7q23.2 

Fusion  of  RSBN1  first  exon  with  BGAS3  exons  6-24 

ASTN2-TBG1D1 6 

Interchromosomal  rearrangement 

9q33.1  -1  7q25.3 

Fusion  of  ASTN2  exons  1  -1 5  with  TBG1 D1 6  exons  2-1 2 

BGAS4-PRKGBP1 

Intrachromosomal  inversion 

20q1  3.1 2-20q1  3.1  3 

Fusion  of  BGAS4  exon  1  with  PRKGBP1  exons  5-22; 
3.5-Mb  inversion 
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Figure  4.  Confirmation  of  the  presence  of  predicted  processed  chi¬ 
meric  mRNA  transcripts  in  MCF-7  using  RT-PCR. 


the  microarray  expression  data  set  involving  50  breast  cancer  cell 
lines  reported  by  Neve  et  al.  (2006)  and  found  that  RAD 5 1C  levels 
are  elevated  in  MCF-7,  but  much  lower  or  absent  in  the  majority  of 
the  other  breast  cancer  cell  lines. 

We  identified  a  translocation  in  another  gene  involved  in 
DSBR,  BRCA1 -interacting  protein- 1  (. BRIP1 ,  also  termed  BACH1). 
BRIP1  was  originally  identified  as  a  helicase-like  protein  that 
interacts  directly  with  BRCA1  and  contributes  to  its  DNA  repair 
function.  BRIP1  binds  to  the  BCRT  repeat  in  BRCA1.  The  C  ter¬ 
minus  of  BRIP1  is  critical  for  its  interaction  with  BRCA1,  and 
a  truncation  mutant  has  been  shown  to  block  DSBR  (Cantor  et  al. 
2001;  Yu  et  al.  2003;  Lewis  et  al.  2005).  Importantly,  germline 
truncation  mutations  of  BRIP1  have  been  identified  in  familial 
breast  cancer  without  mutations  of  BRCA1/2,  and  BRIP1  trun¬ 
cations  confer  a  twofold  increased  risk  of  developing  breast  cancer. 
We  identified  a  translocation  that  results  in  the  loss  of  the  last 
three  exons  (exons  18-20);  however,  the  fused  DNA  (3 pi 4) 
downstream  of  BRIP1  does  not  contain  any  exons  or  introns.  The 
truncation  at  exon  17  of  BRIP1  would  eliminate  the  C-terminal 
third  of  BRIP1  and  eliminate  binding  to  BRCA1.  However,  it  is 
unclear  at  present  whether  the  truncated  mRNA  would  be  stable  as 
there  is  no  transcription  stop  site  or  polyA  tail. 


fusion  breakpoints  and  truncation  breakpoints.  The  PTPRG  break¬ 
points  occur  within  the  chromosome  3  breakpoint  cluster  and  co¬ 
incide  within  a  known  fragile  site.  Another  example  is  the  fusion  of 
the  BMP7  promoter  upstream  of  ZNF21 7  breast  cancer  oncogene 
overexpressed  in  breast  cancer  (Collins  et  al.  2001)  that  we  redis¬ 
covered  but  was  also  previously  described  Volik  et  al.  (2003,  2006). 
The  chromosome  20  rearrangement  hotspot  contains  37  break¬ 
points  surrounding  the  ZNF217  oncogene.  Another  extreme  ex¬ 
ample  of  multiple  rearrangements  is  the  breast  cancer  amplified 
sequence  3  (BCAS3),  occurring  within  the  chromosome  17  rear¬ 
rangement  hotspot.  There  are  seven  breakpoints  located  within  the 
intron-exon  boundaries  and  an  additional  19  nonfusion  break¬ 
points  surrounding  the  BCAS3  gene  region. 

Rearrangements  affect  genes  involved  in  homologous 
double-stranded  break  repair 

We  identified  rearrangements  in  genes  that  code  for  members  of 
protein  complexes  involved  in  double-stranded  break  repair 
(DSBR),  raising  the  possibility  that  defects  in  DSBR  genes  may  have 
contributed  to  genomic  instability  at  certain  stages  of  the  evolu¬ 
tion  of  the  MCF-7  genome.  One  of  the  four  MCF-7  gene  fusions 
that  produced  a  detectable  predicted  chimeric  transcript  is  an 
interchromosomal  fusion  of  RADS  1C  exons  1-7  to  the  neuronal- 
specific  gene  ATXN7  exons  6-13.  RAD 5 1C  is  a  paralog  of  RAD51, 
a  gene  central  to  DNA  DSBR.  RAD51C  is  an  essential  component  of 
a  complex  reported  to  be  involved  in  resolving  holiday  junctions 
(HJs)  formed  during  DSBR  (Liu  et  al.  2007)  and  as  such  is  integral  to 
the  maintenance  of  genomic  stability.  The  translocation  we  have 
identified  eliminates  the  domain  of  RAD51C  that  binds  other 
family  member  homologs  such  as  RAD51D  and  Xrcc3  (Miller  et  al. 
2004),  possibly  disrupting  formation  of  the  complex  responsible  for 
resolving  HJs. 

RAD51C  is  located  at  17q23,  a  region  of  amplification  that 
has  been  extensively  studied  in  MCF-7  cells  and  breast  cancer.  One 
of  the  most  studied  oncogenes  in  breast  cancer,  ErbB2,  is  in  close 
proximity  to  the  1 7q21.2  locus,  which  is  amplified  in  a  number  of 
breast  cancers  (but  not  in  MCF-7)  but  often  independently  of  the 
17q23  amplification.  We  examined  RAD 5 1C  expression  level  in 


Discussion 

We  have  completed  a  sequence-level  survey  of  rearrangements  in 
a  cancer  genome.  One  major  insight  gained  from  this  analysis  is 
the  presence  of  two  types  of  breakpoints — clustered  and  dispersed, 
the  latter  being  associated  with  LCRs.  While  we  have  not  en¬ 
countered  previous  reports  of  genome-wide  association  of  LCRs 
with  DSB  breaks  and  chromosomal  instability  in  tumors,  the  role 
of  LCRs  in  promoting  double-strand  breaks  through  the  replica¬ 
tion  fork  stalling  mechanism  has  recently  been  proposed  in  the 
context  of  genomic  disorders  (Lee  et  al.  2007). 

A  second  major  insight  is  that  the  two  diverse  types  of 
breakpoints  may  have  arisen  during  different  stages  of  the  evolu¬ 
tion  of  the  MCF-7  genome.  Volik  et  al.  (2006)  hypothesized  that 
20q  telomere  loss  initiated  BFB  cycles  and  a  cascade  of  amplifica¬ 
tion  resulting  in  small  highly  rearranged  hotspots  that  colocalize 
DNA  from  different  genomic  regions.  Our  results  show  the  same 
chromosomal  rearrangement  architecture,  albeit  at  higher  reso¬ 
lution  and  are  consistent  with  the  hypothesis  that  BFB  cycles, 
possibly  including  extrachromosomal  amplisomes,  played  an 
initial  role  in  MCF-7  genome  evolution.  The  chromosome  3  rear¬ 
rangement  hotspot  encompasses  the  common  fragile  site  FRA3B, 
prone  to  chromosomal  instability,  and  a  mediator  of  recurrent  BFB 
amplification  found  in  a  variety  of  human  tumors  (Heilman  et  al. 
2002).  Recurrent  breaks  within  common  fragile  sites  propagated 
via  BFB  cycles  amplify  oncogenes  and  promote  tumorgenesis 
(Huebner  and  Croce  2001;  Heilman  et  al.  2002).  Since  both 
RAD51CATXN7  fusion  and  BRIP1  truncation  belong  to  clusters 
possibly  generated  by  the  BFB  mechanism,  a  possible  effect  is 
failure  of  the  HR  mechanism  of  DSBR  and  a  consequent  switch  to 
NHEJ  repair  at  stalled  replication  forks.  A  similar  previously  ob¬ 
served  precedent  is  the  switch  from  HR  to  NEHJ  in  RAD54  ho¬ 
molog  mutants  (Sonoda  et  al.  2006).  The  switch  to  NHEJ  at  some 
point  in  the  evolution  of  MCF-7  would  have  resulted  in  a  mutator 
phenotype  (Loeb  2001)  and  a  pattern  of  extensive  chromosomal 
rearrangements  observed  in  MCF-7. 

The  switch  to  the  rearrangement-creating  NHEJ  would  have 
exposed  the  most  breakage-prone  sites — those  containing  LCRs — by 
converting  simple  replication-associated  breaks  into  detectable 
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Figure  5.  ( A-Q  Cells  treated  with  SULF2  siRNA  have  an  enhanced  proliferation  compared  with  cells  treated  with  control  siRNA.  MCF-7B  (A;  Mao  et  al. 
2005),  MDA  MD231  ( B ),  and  MCF-1 0A  (Q  cells  were  transfected  with  50  nM  SULF2  or  control  siRNA;  1 04  cells  were  plated  in  medium  containing  1 0% 
FBS  48  h  after  transfection  of  the  siRNA.  Cells  were  counted  on  day  2,  4,  6,  and  8.  Experiments  performed  in  triplicate;  error  bars  show  standard  deviation. 
(D-D  Cells  treated  with  SULF2  siRNA  have  an  enhanced  survival  compared  with  cells  treated  with  control  siRNA.  MCF-7B  (D;  Mao  et  al.  2005),  MDA 
MD231  ( E ),  and  MCF-1 0A  (D  cells  were  transfected  with  50  nM  SULF2  or  control  siRNA;  104  cells  were  plated  in  serum-free  medium  48  h  after  trans¬ 
fection  of  the  siRNA.  Cells  were  counted  on  day  2,  4,  and  6.  Experiments  performed  in  triplicate.  Error  bars,  SD.  (G,b)  Treatment  of  MCF-7B  and  MDA 
MB231  cells  with  siRNA  for  SULF2  increases  the  anchorage-independent  growth  capabilities.  After  treatment  with  siRNA,  1 04  cells  were  plated  in  0.3% 
agar  in  growth  medium,  MCF-7B  colony  formation  is  shown  in  C.  Plates  were  incubated  for  21  d,  and  colonies  were  counted;  bar  chart  results  shown  in  H. 
Experiments  performed  in  triplicate.  Error  bars,  SD. 


rearrangements.  An  analogy  here  exists  between  LCRs  and  DSB 
repair  on  one  hand  and  microsatellites  and  mismatch  repair  on 
the  other  (Lengauer  et  al.  1998):  By  presenting  challenges  to  DNA 
replication,  LCRs  and  microsatellites,  expose  weaknesses  in  DSB 
repair  and  mismatch  repair  mechanisms,  respectively.  We  should 
note  that  our  extensive  sequencing  did  not  indicate  increased 
mutability  of  MCF-7  at  the  base  pair  level,  indicating  highly 
functional  mismatch  repair. 

The  two-stage  model  also  accounts  for  the  typical  curve  in¬ 
dicating  increase  in  genome  complexity  during  the  typical  evolu¬ 
tion  of  a  breast  cancer  genome  (Chin  et  al.  2004).  While  the  BFB 
may  account  for  the  steep  slope  of  rise  in  genomic  complexity  in 
MCF-7  during  the  stage  of  in  situ  carcinoma  and  telomere  crisis,  the 
subsequent  instability  mediated  by  the  failure  of  the  homology- 
based  DSB  repair  mechanism  resulting  in  breaks  at  LCR  loci  may 
account  for  the  subsequent  less  steep  slope  that  typically  follows 
completion  of  the  telomere  crisis  stage  and  accompanies  metastasis. 
The  two-stage  model  is  also  consistent  with  ongoing  plasticity  of 
the  MCF-7  genome  as  evidenced  by  polyclonality  and  divergence  of 
MCF-7  sublines  (Jones  et  al.  2000;  Nugoli  et  al.  2003). 

The  third  insight  is  abundance  of  genes  affected  by  rear¬ 
rangements,  and  particularly  of  gene  fusions,  which  exceeds  cur¬ 


rent  estimates  of  the  abundance  of  gene  fusions  in  breast  cancer 
(Mitelman  et  al.  2007).  Our  unbiased  screen  of  MCF-7  cell  lines 
identified  seventy  nine  genes  involved  in  rearrangement  events. 
Ten  gene  fusions  were  identified,  nine  novel  and  one  previously 
reported  by  Ruan  et  al.  (2007),  and  77  other  fusions  involving 
genes  and  gene  truncations. 

The  fourth  insight  is  that  at  least  a  fraction  of  genes  affected 
by  fusions  and  truncations  may  in  fact  be  tumor  suppressors  (e.g., 
PTPRG,  SULF2)  or  may  be  responsible  for  genome  stability  (e.g., 
RAD 5 1C,  BRIP1 ).  Both  BRIP1  and  RAD 5 1C  fall  within  the  cluster  of 
breakpoints  at  17q23  and  are  amplified  in  MCF-7  cells,  indicating 
possible  positive  selection  for  the  amplification.  Such  positive 
selection  would  be  consistent  with  previously  reported  dominant¬ 
negative  effects  observed  in  genes  responsible  for  genome  stability 
(Milne  and  Weaver  1993). 

The  fifth  insight  is  that  chimeric  transcripts  can  in  fact  be 
discovered  by  directly  mapping  rearrangements  at  the  level  of 
genomic  DNA  and  then  predicting  specific  chimeric  transcripts. 
This  opens  the  possibility  of  discovering  recurrent,  mechanisti¬ 
cally  and  prognostically  significant  rearrangements  by  simply 
mapping  a  sufficient  number  of  genomes  and  directly  observing 
recurrent  events. 
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In  conclusion,  this  study  validates  the  utility  of  mapping 
rearrangements  in  cancer  genomes  by  providing  mechanistically 
significant  insights  into  cancer  evolution  and  identifying  genes 
likely  involved  in  cancer  progression.  Building  on  the  benchmarks 
developed  in  this  study  next  steps  include  technological  and 
methodological  improvements  that  will  allow  scale-up  to  whole 
genomes  and  to  multiple  cell  lines  and  tumor  samples  at  a  more 
affordable  cost,  thus  broadening  applications  in  the  research 
context  and  eventually  in  clinical  settings. 

Methods 

Fosmid  library  preparation  and  end-sequencing  of  clone  inserts 

Fosmid  libraries  were  prepared  from  each  of  the  six  96-BAC  pools 
indicated  in  Figure  IB  using  the  Epicentre  EpiFOS  Fosmid  Library 
Production  Kit. 

DNA  sequencing 

The  ends  of  fosmid  inserts  were  obtained  using  Sanger-based  se¬ 
quencing  on  an  ABI  3730XL.  Approximately  300,000  short  (100- 
bp)  reads  were  obtained  from  each  of  the  three  192-BAC  pools 
indicated  in  Figure  IB  using  the  454  Life  Sciences  (Roche)  GS 
machine.  Detailed  sequencing  statistics  are  included  in  the  Sup¬ 
plemental  Table  1.  The  sequencing  reads  are  available  for  down¬ 
load  from  the  public  project  pages  at  http://www.genboree.org. 

Mapping  reads  onto  the  reference  genome 

Fosmid-end  reads,  454  Life  Sciences  (Roche)  shotgun  reads,  and 
BAC-end  reads  were  mapped  onto  the  reference  human  genome 
(March  2006  assembly,  Build  36)  using  the  BLAT  program.  BLAT 
parameters  used  for  mapping  are  described  in  Supplementary 
Materials  and  coordinates  are  available  through  the  Genboree  site 
on  the  Breast  Cancer  project  page  at  http://www.genboree.org. 

PCR  primer  design  pipeline 

PCR  primers  were  designed  for  amplifying  breakpoint  regions  us¬ 
ing  repeat-masked  human  genome  assembly  (March  2006  assem¬ 
bly,  Build  36)  using  a  semi-automated  primer  design  pipeline. 
Primer  3  primer  design  program  was  run  to  obtain  a  set  of  nested 
primers  using  two  categories  or  parameters,  "stringent"  and  "re¬ 
laxed."  Primer  pairs  in  each  category  were  scored,  and  the  highest- 
scored  primer  pair  was  selected  for  initial  round  of  PCR  amplifi¬ 
cation.  Priority  was  also  given  to  the  stringent  category.  In  case  of 
failure,  additional  lower-scoring  primer  pairs  were  employed. 
More  details,  including  Primer  3  parameters,  can  be  found  in 
Supplemental  materials. 

PCR  amplification  of  genomic  DNA  from  cell  lines 

Breakpoint  confirmation  included  PCR  amplification  of  a  pool  of 
genomic  DNA  from  six  different  sublines  of  MCF-7  cells  (B,  BK,  C, 
D,  L,  and  Neo).  DNA  isolated  from  immortalized  but  nontrans- 
formed  mammary  epithelial  cells  (MCF-10A)  and  normal  human 
female  DNA  (Novagen)  were  used  as  negative  controls.  Genomic 
cell  line  DNA  was  isolated  with  the  DNeasy  kit  (Qiagen).  PCR 
bands  were  visualized  on  a  2%  agarose  gel. 

Breakpoint  clustering  algorithm 

Consecutive  breakpoints  that  are  closer  than  1.1  Mbp  in  the  ref¬ 
erence  genome  assembly  were  connected.  Runs  of  consecutive 


connected  breakpoints  with  eight  or  more  members  are  declared 
to  constitute  a  cluster.  Four  clusters  on  chromosomes  1,  3,  17,  and 
20  indicated  in  Figure  3  were  obtained  in  this  fashion. 

Identification  of  LCR  regions 

Each  of  the  157  MCF-7  breakpoints  was  examined  for  the  presence 
of  LCR.  Intrachromosomal  and  interchromosomal  LCRs  were 
detected  by  applying  a  novel  algorithmic  method  to  the  human 
genome  assembly  (March  2006  assembly,  Build  36).  The  method 
involved  self-comparison  of  the  human  genome  using  the  Pash 
program  (Kalafus  et  al.  2004)  and  an  automated  pipeline  for  seg¬ 
mentation,  clustering,  and  parsing  of  LCRs  based  on  sequence 
feature  analysis.  The  LCRs  detected  by  this  method  cover  6.15%  of 
the  whole  genome  in  length,  of  which  18.7%  are  gene-containing 
regions.  A  detailed  description  of  the  algorithm  is  available  in 
Supplemental  materials. 

Analysis  of  recurrent  copy  number  changes  in  157  somatic 
breakpoint  loci 

Copy  number  variation  in  the  157  somatic  breakpoint  loci  iden¬ 
tified  in  this  study  was  examined.  In  order  to  identify  recurrent 
copy  number  changes  in  breakpoint  loci,  array  CGH  data  from  201 
breast  cancer  cell  lines  and  tumors  (Chin  et  al.  2006;  Neve  et  al. 
2006;  Shadeo  and  Lam  2006;  Jonsson  et  al.  2007)  were  integrated. 
A  locus  was  declared  recurrently  amplified  if  amplification  was 
reported  in  more  than  20%  cases  for  the  specific  locus.  Detailed 
results  are  compiled  in  a  table  where  breakpoints  are  sorted  by 
their  level  of  recurrent  copy  number  amplification  (for  details,  see 
Supplemental  materials  and  Supplemental  Table  3). 

Analysis  of  recurrent  expression  and  copy  number  changes 
in  79  breakpoint-associated  genes 

Patterns  of  recurrent  copy  number  and  expression  level  variation 
were  examined  for  79  genes  associated  with  the  157  somatic 
breakpoints  identified  in  this  study.  Expression  data  from  50 
breast  cancer  cell  lines  (Neve  et  al.  2006)  were  combined  with  copy 
number  data  from  201  breast  cancer  cell  lines  and  tumors  (Chin 
et  al.  2006;  Neve  et  al.  2006;  Shadeo  and  Lam  2006;  Jonsson  et  al. 
2007).  Detailed  results  are  compiled  in  a  table  where  genes  are 
sorted  by  their  level  of  recurrent  alteration,  (for  details,  see  Sup¬ 
plemental  Materials  and  Supplemental  Table  2).  Additionally, 
copy  number  data  from  an  Affymetrix  100k  SNP  chip  were  used  to 
identify  breakpoint  genes  that  also  associate  with  regions  of  copy 
number  alteration  (see  Supplemental  Table  3). 

Detection  of  predicted  fusion  transcripts  by  RT-PCR 

mRNA  from  exponentially  growing  MCF-7  and  MCF-10A  cells 
were  isolated  with  the  RNeasy  kit  (Qiagen).  To  determine  the 
presence  of  a  fusion  transcript,  primers  were  designed  across  the 
fusion  point  on  cDNA  using  Primer3.  Control  primers  were 
designed  on  either  side  of  the  fusion.  cDNA  was  generated  by  us¬ 
ing  gene  specific  primers.  PCR  amplification  of  the  mRNA  was 
restricted  to  35  cycles.  PCR  bands  were  visualized  on  a  2%  agarose 
gel,  and  verified  by  sequencing  to  confirm  that  the  product  con¬ 
tained  mRNA  from  both  genes  involved. 

Cell  growth  and  soft-agar  experiments 

For  the  cell  growth  experiments,  10,000  cells  were  plated  in  trip¬ 
licate  in  24-well  plates.  The  cells  were  grown  in  growth  medium, 
containing  10%  FBS,  or  in  serum-free  medium.  Growth  rate  was 
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measured  on  days  0,  2,  4,  and  6  with  a  Coulter  Counter  (Beckman 
Coulter). 

Colony  growth  assays  were  performed  as  followed:  1  mL  of 
solution  of  0.5%  noble  agar  in  growth  or  serum-free  medium  was 
layered  onto  30  x  10-mm  tissue  culture  plates.  A  total  of  1  X  104 
cells  was  mixed  with  1  mL  of  0.3%  agar  solution  prepared  in 
a  similar  manner  and  layered  on  top  of  the  0.5%  agar  layer.  Plates 
were  incubated  at  37°C  in  5%  C02  for  21  d.  The  experiment  was 
performed  in  triplicate. 

Knock-down  of  SULF2  using  short  interfering  RNA  (siRNA) 

Transfections  with  SULF2  and  control  nonspecific  siRNA 
(Dharmacon)  were  carried  out  using  50  nM  pooled  siRNA  duplexes 
and  4  jjlL  of  Dharmafect  (Dharmacon)  in  six-well  plates  according 
to  the  manufacturer's  protocol.  After  48  h,  the  cells  were  prepared 
the  respective  assays. 
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Cancer  genomes  frequently  undergo  genomic  instability  resulting  in  accumulation  of  chromo¬ 
somal  rearrangement.  To  date,  one  of  the  main  challenges  has  been  to  confidently  and  accu¬ 
rately  identify  these  rearrangements  by  using  short-read  massively  parallel  sequencing.  We 
were  able  to  improve  cancer  rearrangement  detection  by  combining  two  distinct  massively 
parallel  sequencing  strategies:  fosmid-sized  (36  kb  on  average)  and  standard  5  kb  mate  pair 
libraries.  We  applied  this  combined  strategy  to  map  rearrangements  in  two  breast  cancer  cell 
lines,  MCF7  and  HCC1954.  We  detected  and  validated  a  total  of  91  somatic  rearrangements 
in  MCF7  and  25  in  HCC1954,  including  genomic  alterations  corresponding  to  previously  reported 
transcript  aberrations  in  these  two  cell  lines.  Each  of  the  genomes  contains  two  types  of  break¬ 
points:  clustered  and  dispersed.  In  both  cell  lines,  the  dispersed  breakpoints  show  enrichment  for 
low  copy  repeats,  while  the  clustered  breakpoints  associate  with  high  copy  number  amplifica¬ 
tions.  Comparing  the  two  genomes,  we  observed  highly  similar  structural  mutational  spectra 
affecting  different  sets  of  genes,  pointing  to  similar  histories  of  genomic  instability  against  the 
background  of  very  different  gene  network  perturbations. 

Keywords  Copy  number  variation,  fosmid  diTag,  gene  fusion,  genomic  instability,  massively 
parallel  sequencing 
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End  sequence  profiling  of  clonal  libraries  have  been  used 
extensively  to  discover  structural  variation  in  both  normal  and 
cancer  genomes  (1-5).  Recently,  the  adoption  of  massively 
parallel  sequencing  has  supplemented  structural  variation 
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detection  by  identifying  rearrangements  at  fine  scale  resolu¬ 
tion  for  both  normal  and  cancer  genomes  (6-14).  These 
massively  parallel  sequencing  studies  have  significantly 
added  to  the  catalog  of  genomic  rearrangements,  but  the 
limited  insert  sizes  between  paired  ends  have  provided  less 
power  than  larger  insert  clones  to  map  across  duplications 
and  repeat-rich  regions  in  the  genome,  thereby  missing 
a  large  fraction  of  variation  (2,15).  Massively  parallel  mate  pair 
sequencing  is  also  hindered  by  high  false-positive  rear¬ 
rangement  discovery  rates,  requiring  additional  breakpoint 
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validation.  Commonly  used  techniques  for  breakpoint  valida¬ 
tion  include  optical  mapping  based  on  restriction  enzyme 
maps  or  incorporated  fluorochrome-labeled  nucleotide  (2,16), 
hybridization  of  fluorescent  probes  that  span  rearrangements 
(5,17),  and  polymerase  chain  reaction  amplification  across 
aberrant  fusions  followed  by  Sanger-based  sequencing  of  the 
breakpoint  amplicons  (1,7,11,12,14).  Although  these  valida¬ 
tion  techniques  offer  proof  of  genomic  rearrangement,  none 
are  currently  amenable  to  high-throughput  workflows. 

We  supplement  the  limited  insert  size  of  standard 
massively  parallel  mate  pair  sequencing  by  incorporating 
fosmid-sized  insert  libraries,  thereby  providing  additional 
validation  of  detected  rearrangements.  Our  fosmid-sized 
mate  pair  libraries,  called  fosmid  diTags,  leverage  the 
affordable  costs  and  high-throughput  capacity  of  massively 
parallel  sequencing  while  providing  clone-sized  inserts  able 
to  span  long,  repetitive  sequence  elements.  Fosmid  diTags 
are  well  suited  for  rearrangement  detection  either  in  stand¬ 
alone  or  complementary  fashion  with  other  mate  pair 
libraries;  fosmid  diTags  are  also  advantageous  for  de  novo 
genome  assembly  where  larger  insert  size  facilitates  greater 
continuity  (18).  Fosmid  diTags  are  an  extension  of  paired 
end  tag  methods  (14,19—21),  where  short  paired  tags  from 
the  ends  of  DNA  fragments  are  enzymatically  extracted  and 
covalently  linked  as  diTag  constructs  for  high-throughput 
sequencing.  Fosmid  diTag  workflow  details  are  provided  in 
Supplemental  Figures  1  and  2. 

Materials  and  methods 

Sequencing  library  preparation 

Paired  end  sequencing  methods  exploit  the  fact  that  struc¬ 
tural  abnormalities  consist  of  two  chromosomal  segments 
that  are  in  a  relative  position  and  orientation,  or  at  a  relative 
distance  that  is  not  consistent  with  the  reference  genome 
assembly.  Construction  of  paired  end  sequencing  libraries 
that  adequately  cover  the  genome  of  interest  allows  for 
comprehensive  identification  of  structural  abnormalities. 

A  total  of  1.55  million  MCF7  (ATCC  [American  Type 
Culture  Collection,  Manassas,  VA]  HTB-22)  and  1.50  million 
HCC1954  (ATCC  CRL-2338)  fosmids  were  cloned  by  using 
the  novel  pFosDT1.2  vector  (derived  from  the  Epicentre 
pCCIFOS  plasmid).  The  pFosDT1.2  vector  contains  two 
EcoP15l  restriction  sites  that  flank  the  site  of  insertion. 
EcoP15l,  a  type  III  restriction  endonuclease,  cuts  25  and  27 
bp  downstream  of  its  recognition  site  producing  a  2  bp  5' 
overhang  and  requires  two  separated  and  inversely  oriented 
recognition  sites  in  supercoiled  DNA  for  native  cleavage. 
Addition  of  sinefungin  in  the  EcoP15l  digest  reaction  facili¬ 
tates  cleavage  at  all  recognition  sites  independent  of  DNA 
topology  (22).  Starting  with  10  pig  of  purified  pooled  fosmid 
DNA  from  each  breast  cancer  cell  line,  two  independent  long- 
range  clonal  insert  fosmid  diTag  massively  parallel 
sequencing  libraries  were  produced.  For  each  fosmid  library, 
26  bp  end  tags  from  the  insert  termini  were  isolated  and 
concatenated  (Supplemental  Figure  2). 

Illumina  mate  pair  whole-genome  shotgun  libraries,  of 
insert  sizes  ranging  from  4  to  6  kb,  were  additionally  con¬ 
structed  with  10  (ig  of  genomic  DNA  from  each  of  the  MCF7 
(ATCC  HTB-22)  and  HCC1954  (ATCC  CRL-2338)  cell  lines. 


Mate  pair  libraries  were  prepared  according  to  the  manu¬ 
facturer’s  instructions  (Illumina  PE-112-1002).  Two  separate 
MCF7  mate  pair  libraries  with  4  kb  and  6  kb  inserts  were 
constructed,  and  a  single  HCC1954  mate  pair  library  with  5 
kb  inserts  was  constructed. 

The  fosmid  diTag  and  Illumina  mate  pair  libraries  were 
sequenced  on  an  Illumina  Genome  Analyzer  II  massively 
parallel  sequencing  system  following  the  manufacturer’s 
instructions.  Raw  sequence  data  for  the  fosmid  diTag  and 
standard  Illumina  mate  pair  libraries  are  available  online 
(http://www.genboree.org/breastCellLineReads/). 


Mapping  to  reference  genome 

Novocraft  V2.05.02  was  used  to  align  quality-filtered  paired 
end  reads  to  the  reference  human  genome  (March  2006 
assembly,  National  Center  for  Biotechnology  Information 
[NCBI]  build  36.1,  University  of  California-Santa  Cruz 
[UCSC]  build  HG18).  Novoalign  parameters  used  for 
mapping  are  described  in  the  Supplementary  Materials,  and 
mapping  coordinates  are  available  for  viewing  and  download 
through  the  Genboree  open-hosting  genome  browser  (http:// 
www.genboree.org/). 


Structural  rearrangement  calling 

Fosmid  diTags  and  Illumina  mate  pair  sequences  that  align 
discordantly  were  used  to  call  putative  structural  rearrange¬ 
ments.  The  combined  fosmid  diTag  and  standard  Illumina 
mate  pair  structural  rearrangements  that  validated  as  cancer- 
specific  somatic  mutations  are  available  in  Supplemental 
Table  1. 

Determining  structural  variants  from  Illumina  mate  pair 
and  fosmid  diTag  sequences  is  complicated  by  two  factors: 
the  contamination  of  inward-facing  reads  and  the  formation 
of  chimeric  clones,  respectively.  Inward-facing  reads  are 
paired  end  sequences  from  a  contiguous  piece  of  DNA  sized 
equal  to  the  final  sequencing  library  length,  approximately 
400  bp.  Formation  of  chimeric  clones  during  the  fosmid  diTag 
procedure  introduce  false  information  about  the  distance  and 
orientation  between  two  reads,  complicating  structural 
variant  calling. 

False-positive  breakpoints  called  by  inward-facing  reads 
were  removed  before  reporting  structural  variants  by  filtering 
the  discordant  read  clusters.  Inward-facing  read  clusters 
were  filtered  on  the  basis  of  size  and  their  inward-facing  read 
orientation.  Because  inward-facing  read  clusters  are  limited 
by  the  final  sequencing  library  length,  they  are  easily  identi¬ 
fied  and  may  be  removed  from  further  analysis  (see 
Supplementary  Materials  for  filter  parameters).  However, 
inward-facing  read  clusters  that  span  truly  positive  rear¬ 
rangements  will  also  be  removed,  thereby  introducing 
detection  voids.  To  overcome  fosmid  diTag  chimera  noise, 
discordant  tags  supporting  the  same  structural  variant  were 
clustered.  Clusters  are  formed  if  there  are  at  least  two 
uniquely  mapping  paired  end  signatures  with  corroborating 
genomic  positions,  sizes,  and  read  orientations.  Such 
a  strategy  is  called  standard  clustering,  and  it  is  commonly 
used  (6,8,10,23-25). 


Fosmid  diTag  breast  cancer  sequencing 
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Breakpoint  spanning  polymerase  chain  reaction 
primer  design  pipeline 

Polymerase  chain  reaction  (PCR)  primers  were  designed  for 
amplification  across  aberrant  fusions  by  using  the  human 
reference  genome  (March  2006  assembly,  NCBI  build  36.1, 
UCSC  build  HG18).  The  Primer3  primer  design  algorithm 
was  used  to  obtain  a  set  of  nested  primers  with  two  cate¬ 
gories  of  parameters,  stringent  and  relaxed.  Primer  pairs  in 
each  category  are  scored,  and  the  highest  scoring  primer 
pair  is  selected  for  PCR  assay  validation.  Priority  is  given  to 
the  stringent  category  using  the  repeat-masked  human 
reference  genome.  In  cases  of  PCR  amplification  failure, 
additional  lower  scoring  primer  pairs  were  utilized.  More 
details,  including  Primer3  parameters,  can  be  found  in  the 
Supplementary  Materials;  the  automated  primer  design 
pipeline  code  is  available  for  download  (http://github.com/ 
oliverhampton/Breakpoint-Primer-Design). 

Copy  number  variation  calling 

Uniquely  mapping  reads  were  used  as  input  for  the  read- 
Depth  R  package  (26),  which  calls  copy  number  alterations 
by  evaluating  depth  of  sequence  coverage.  The  package’s 
default  parameters  were  used  including  an  overdispersion 
value  of  3  and  a  false  discovery  rate  of  0.01 .  The  readDepth 
package  also  provided  a  breakpoint  refinement  tool  that 
allowed  us  to  adjust  copy  number  segment  ends  to  matched 
breakpoint  positions. 

Breakpoint  clustering  algorithm 

Combined  fosmid-sized  and  5  kb  breakpoints  that  map 
within  2  Mb  in  MCF7  and  5  Mb  in  HCC1954  were  clustered. 
Chromosome  segment  annotations  were  retained  if  five  or 
more  breakpoints  in  MCF7  or  two  or  more  breakpoints  in 
HCC1954  were  contained  within  the  cluster.  For  each  set  of 
MCF7  and  HCC1954  breakpoint  clusters,  the  cluster  con¬ 
taining  the  highest  number  of  breakpoints  served  as  a  seed 
for  a  connected  graph  (or  clique)  where  the  chromosome 
segments  are  nodes  and  spanning  breakpoints  are  edges. 
In  this  manner,  cliques  of  four  breakpoint  clusters  in  MCF7 
on  chromosomes  1,  3,  17,  and  20,  and  five  breakpoint 
clusters  in  HCC1954  on  chromosomes  5,  8,  and  11  were 
constructed. 

Identification  of  low  copy  repeat  regions 

Each  of  the  fosmid  diTag  and  lllumina  mate  pair  breakpoints 
from  MCF7  and  HCC1954  breast  cancer  cell  lines  were 
examined  for  the  presence  of  low  copy  repeats  (LCRs).  Intra- 
and  interchromosomal  homologous  LCRs  were  detected  by 
applying  a  novel  algorithmic  method  to  the  human  reference 
genome  sequence  (March  2006  assembly,  NCBI  build  36.1, 
UCSC  build  HG18).  The  method  achieved  higher  sensitivity 
than  previously  applied  methods  (27)  by  using  /c-mer 
frequency  sequence  information  to  detect,  parse,  and  cluster 
LCRs  without  removing  high  copy  number  repetitive 
elements  (repeat  masking).  The  LCRs  detected  by  this 
method  covered  6%  of  the  whole  genome  in  length,  of  which 


19%  were  gene-containing  regions.  A  detailed  description  of 
the  algorithm  is  available  in  the  Supplementary  Materials. 

PCR  amplification  of  genomic  DNA  from  cell  lines 

Breast  cancer  cell  line  breakpoint  confirmation  used  PCR 
amplification  of  MCF7  (ATCC  HTB-22)  and  a  pool  of  nega¬ 
tive  control  genomic  DNA  isolated  from  human  female 
(Novagen  70605-3)  and  from  two  different  cell  lines, 
MCF10A  (ATCC  CRL-10317)  and  HCC1599-BL  (ATCC 
CRL-2332);  and  PCR  amplification  of  genomic  DNA  from 
HCC1954  (ATCC  CRL-2338)  and  negative  control 
HCC1954-BL  (ATCC  CRL-2339)  cell  lines.  Genomic  cell  line 
DNA  was  isolated  with  the  DNAeasy  kit  (Qiagen,  Valencia, 
CA)  according  to  the  manufacturer’s  instructions.  PCR  bands 
were  visualized  on  a  2%  agarose  gel. 

Results 

Combining  fosmid  diTag  and  5  kb  mate  pair 
sequencing  libraries  increases  specificity  to 
detect  chromosomal  rearrangements 

The  lllumina  standard  mate  pair  libraries,  with  an  average  5  kb 
insert  size,  generated  2.9  and  1 .9  Gb  of  sequence  data  for 
MCF7  and  HCC1954,  respectively.  Upon  mapping  to  the 
reference  genome,  the  relatively  short  distance  between  the 
paired  ends  was  compatible  for  PCR  primer  design  across 
aberrant  fusions,  and  the  density  of  mapped  reads  allowed  for 
the  measurement  of  segment  copy  number.  The  fosmid  diTag 
libraries  generated  93.3  and  56.9  Mb  of  sequence  data  for 
MCF7  and  HCC1 954,  respectively.  Because  of  the  larger  insert 
size,  mapping  of  fosmid  diTags  provided  nearly  identical 
percent  insert  coverage  of  the  reference  genome  (81%  mean 
±  7%  SD),  as  observed  from  the  lllumina  mate  pair  libraries. 

Rearrangements  were  reported  where  at  least  two  inde¬ 
pendent  pairs  of  ends  showed  discrepancy  by  their  predicted 
size  and/or  orientation.  Discordant  mate  pairs  and  diTags 
were  reported  when  the  distance  between  the  mapped  ends 
was  in  excess  of  2  standard  deviations  from  the  insert  mean. 
Discordant  ends  were  clustered  on  the  basis  of  mapping 
position  and  orientation  discrepancies,  thereby  refining  the 
position  of  the  detected  breakpoint.  From  the  lllumina  5  kb 
mate  pair  libraries,  we  could  identify  23,555  putative  rear¬ 
rangements  in  MCF7  and  3,824  in  HCC1954.  Breakpoint 
spanning  PCR  primer  designs  were  able  to  be  created  for 
23%  of  these  rearrangements  in  MCF7  and  61%  in 
HCC1954.  From  the  fosmid  diTag  libraries,  we  identified  713 
putative  rearrangements  in  MCF7  and  345  in  HCC1954; 
because  of  the  much  longer  fosmid  diTag  insert  size  and 
relatively  low  fold  coverage,  standard  and  long-range  PCR 
primer  designs  were  incompatible. 

The  high  percentage  of  failed  PCR  primer  designs  from 
the  MCF7  mate  pair  data  was  due  to  the  increased  preva¬ 
lence  of  repetitive  sequence  elements  surrounding  aberrant 
fusions.  Closer  inspection  of  the  PCR  primer  design  failure 
sites  revealed  overlap  with  repeat-masked  sections  of  the 
human  genome  and  disproportionate  calling  of  small  indels 
(2-4  kb)  at  a  rate  10  times  more  than  expected.  We  spec¬ 
ulate  that  MCF7  has  unique  defects  in  its  DNA  repair 
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pathways,  which  explains  the  imbalance  of  mutations 
between  the  two  cell  lines.  We  have  previously  shown 
RAD51C\o  be  mutated  in  MCF7  (28);  such  a  mutation  could 
affect  the  Holliday  junction  (HJ)  (29)  resolution  machinery, 
causing  misrecognition  of  HJs,  cruciforms,  and  other 
homology-driven  secondary  structures  leading  to  double¬ 
strand  breaks  and  accumulation  of  such  indels  (30). 

Breakpoint  spanning  primers  from  the  lllumina  mate  pairs 
were  applied  to  their  respective  breast  cancer  cell  line 
genomes  and  normal  controls.  In  most  cases,  the  PCR  assay 
failed  to  produce  an  amplification  product,  indicating  a  high 
rate  of  false-positive  rearrangement  detection.  In  the  cases 
where  a  breakpoint  amplicon  was  produced,  the  majority 
identified  normal  structural  polymorphisms— only  a  small 
percentage  identified  breast  cancer-specific  somatic  muta¬ 
tion  (Figure  IB).  Interestingly,  combining  fosmid  diTag  and 
lllumina  mate  pair  data,  and  selecting  rearrangements 
detected  by  both  methods  showed  a  threefold  enrichment  for 
cancer-specific  somatic  mutation  and  a  twofold  reduction  in 
false-positive  detection  when  compared  to  the  lllumina  mate 
pair  libraries  alone.  Combining  fosmid-sized  and  5  kb  mate 
pairs  provides  cross-validation  to  rearrangement  detection; 
moreover,  the  incorporation  of  longer  fosmid-sized  inserts 
increases  specificity  to  detect  breast  cancer-specific  somatic 
mutation  and  decreases  the  reporting  of  false-positive  rear¬ 
rangements  when  compared  to  the  shorter  5  kb  inserts 
alone.  Combining  fosmid  diTag  and  5  kb  mate  pair  libraries, 
we  identified  309  chromosomal  rearrangements  in  MCF7 
and  72  in  HCC1954,  and  designed  breakpoint  spanning  PCR 
primers  for  approximately  90%  of  them  (Figure  1  A).  Although 
it  is  desirable  to  increase  the  specificity  of  chromosomal 
rearrangement  detection,  it  must  be  noted  that  a  corre¬ 
sponding  loss  of  sensitivity  is  associated  with  this 
improvement. 

Corresponding  genomic  DNA  fusions  exist  for 
upward  of  half  of  the  gene  fusions  and 
truncations  previously  detected  by  transcript 
mapping 

Chimeric  gene  transcripts  have  been  previously  identified  in 
MCF7  (31,32)  and  HCC1954  (33)  by  transcript  mapping. 
Transcript  mapping  is  analogous  to  targeted  paired  end 
sequencing;  however,  instead  of  investigating  aberrant 
genomic  fusions,  chimeric  mRNA  transcripts  are  queried. 
Transcript  mapping  delivers  a  gene-centric  view  of  rear¬ 
rangements  that  encompass  posttranscriptional  modifica¬ 
tions,  but  can’t  detect  genomic  rearrangements  outside  of 
gene  coding  regions.  We  therefore  sought  to  comprehen¬ 
sively  identify  rearrangement  events  at  the  genomic  DNA 
level  that  may  have  caused  chimeric  or  truncated  mRNA 
transcripts. 

In  MCF7,  we  identified  10  of  19  and  9  of  30  genomic  rear¬ 
rangements  correlated  with  corresponding  chimeric  mRNA 
transcripts  reported  by  Maher  et  al.  (9,31)  and  Inaki  et  al.  (32), 
respectively.  These  genomic  lesions  involve  oncogenes 
(' TMEM49 ),  tumor  suppressors  ( SULF2 ,  PTPRG),  constituents 
of  DNA  double-strand  break  repair  (RAD51C,  BRIP1),  and 
other  genes  related  to  cell  cycle,  growth,  and  survival 
(i RPS6KB1 ,  ELOVL7,  ABCA5)  (Table  1). 


In  HCC1 954,  we  identified  3  of  7  genomic  rearrangements 
resulting  in  chimeric  or  truncated  gene  transcripts  reported  by 
Zhao  et  al.  (33).  These  three  gene  truncations  (EIF3E,  NSD1, 
PVT  1)  are  implicated  in  differing  aspects  of  breast  and  ovarian 
cancers,  and  acute  myeloid  leukemia  pathophysiologies 
(Table  1).  In  addition,  we  discovered  a  novel  genomic  rear¬ 
rangement  of  UIMC1  ( RAP80 ),  a  DNA  double-stranded  break 
repair  accessory  protein  and  suspected  tumor  suppressor, 
resulting  in  the  loss  of  its  last  5  exons  (exons  11-15),  which 
would  eliminate  its  DNA  recognition  and  binding  abilities.  The 
fused  DNA  (8q24.21)  downstream  of  the  UIMC1  breakpoint 
does  not  contain  any  exons  or  introns,  and  it  remains  unclear 
whether  the  truncated  mRNA  would  be  stable  as  there  is  no 
transcription  stop  site  or  polyA  tail. 

High-level  amplifications  of  distinct  driver 
oncogenes  in  MCF7  and  HCC1954  are  detected 
from  mapped  read  density 

The  luminal-type  MCF7  and  ERBB2-  overexpressing 
HCC1954  breast  cancer  cell  lines  are  both  highly  amplified 
and  display  complex  structural  mutability  phenotypes; 
exhibiting  distinct  profiles  of  genome  structural  rearrange¬ 
ment  and  copy  number  variation.  We  integrated  read  density 
and  breakpoint  information  from  mapped  fosmid-sized  and 
5  kb  mate  pair  libraries  to  accurately  identify  copy  number 
variation  by  the  readDepth  R  package  (26).  Figures  1 A  and  2 
show  visualized  copy  number  counts. 

For  comparison,  we  obtained  data  for  both  breast  cancer 
cell  lines  run  on  Affymetrix  100K  single  nucleotide  poly¬ 
morphism  (SNP)  chips  segmented  with  the  Gain  and 
Loss  Analysis  of  DNA  (GLAD)  algorithm,  microarray  data 
available  in  the  NCBI  GEO  database  (accession  GSE13696) 
(http://www.ncbi. nlm.nih.gov/projects/geo/query/acc.cgi?acc= 
GSE13696)  (34).  Even  with  approximately  onefold  sequence 
coverage,  our  results  provided  higher  resolution  than  the 
Affymetrix  arrays  and  allowed  for  copy  number  variation  calls 
to  be  made  in  many  regions  where  no  array  probes  exist. 
A  look  at  gross  features  showed  good  concordance  between 
the  two  platforms,  including  detection  of  previously  described 
high-level  amplifications  on  MCF7  (28);  and  on  HCC1954  (12) 
(highlighted  regions  in  Figure  2B).  Notably,  our  sequence- 
based  approach  provided  higher  dynamic  range  and 
revealed  multiple  regions  in  both  cell  lines  that  have  been 
copied  50  to  100  times.  As  a  result  of  saturation  effects  and 
lower  resolution,  these  regions  are  called  with  far  lower  copy 
number  on  the  Affymetrix  arrays.  There  are  a  small  number  of 
aberrations,  including  regions  on  chromosomes  2  and  9  in  the 
MCF7  genome,  which  we  believe  to  be  biological  differences 
between  different  passages  and/or  sublines  of  MCF7.  Most 
other  discordant  events  are  likely  attributable  to  increased 
coverage,  resolution,  and  dynamic  range  from  the  sequence- 
based  assays. 

In  MCF7,  a  20  kb  segment  on  cytoband  20q13.31 
showed  the  highest  level  of  amplification  with  a  copy 
number  count  of  70.  This  region  encompasses  the  BMP7 
gene,  a  member  of  the  transforming  growth  factor-beta 
superfamily,  and  corresponds  to  the  fusion  of  the  BMP7 
promoter  upstream  of  the  ZNF217  oncogene,  which  is 
overexpressed  in  breast  cancer  (35).  ZNF217  can  attenuate 
apoptotic  signals  resulting  from  telomere  dysfunction  and 
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Figure  1  (A)  Circular  visualizations  of  the  MCF7  and  HCC1954  genomes  obtained  by  Circos  (52)  software.  Chromosomes  are 

individually  colored  with  centromeres  in  white.  Copy  number  variation  is  plotted  with  gains  in  blue  and  losses  in  red.  The  colored 
rearrangements  depict  breast  cancer-specific  somatic  mutations  from  the  combined  fosmid-sized  and  5  kb  mate  pair  libraries.  Green 
lines  denote  intrachromosomal  and  purple  lines  denote  interchromosomal  rearrangements.  (B)  Venn  diagrams  comparing  the 
numbers  of  fosmid-sized  and  5  kb  mate  pair  rearrangements,  PCR  primer  designs,  PCR  assays  that  produced  breakpoint  amplicon  vs. 
no  amplification  product,  and  rearrangements  that  are  validated  as  breast  cancer-specific  mutation  vs.  normal  structural  variation  in 
the  MCF7  and  HCC1954  genomes. 


may  promote  neoplastic  transformation  during  later  stages 
of  malignancy  (36). 

In  HCC1954,  a  51  kb  segment  on  cytoband  17q12 
showed  the  highest  level  of  amplification  with  a  copy 
number  count  of  117.  This  region  encompasses  HER2/neu 
(also  known  as  ERBB-2),  which  is  known  to  be  overex¬ 
pressed  in  this  cell  line.  HER2  overexpression  in  breast 
cancer  is  associated  with  an  aggressive  tumor  phenotype, 
increased  disease  recurrence,  and  overall  worse  prognosis. 
HER2  overexpression  serves  not  only  as  a  prognostic 
marker,  but  also  as  a  drug  target  for  the  monoclonal  anti¬ 
body  trastuzumab.  Also  of  interest,  a  59  kb  segment  on 
cytoband  1 1  ql  3.2,  encompassing  the  gene  CCND1, 


showed  ninefold  amplification.  The  CCND1  gene,  a  key 
cell-cycle  regulator,  is  often  overexpressed  in  breast  cancer 
patients,  and  correlates  with  shorter  relapse-free  survival 
times  (37). 

MCF7  and  HCC1954  exhibit  defects  in  the 
homologous  double-strand  break  repair  pathway 

In  the  MCF7  and  HCC1954  breast  cancer  cell  lines,  we 
identified  rearrangements  in  genes  that  code  for  members  of 
protein  complexes  involved  in  DNA  double-stranded  break 
repair  (DSBR),  raising  the  possibility  that  distinct  defects  in 


Table  1  Validated  chimeric  protein  fusions  and  gene  truncations  in  MCF7  and  HCC1954  breast  cancer  cell  lines 


Cell  line 

Genomic  changes 

Chromosome  locations 

Genes  affected 

Effect  on  coding 

Somatic  changes  reported  in 
cancers 

Validation  method 

MCF7 

Interchromosomal 

translocation 

t(1 7;20)(q23.2;q1 3.1 3) 

BCAS3,  BCAS4 

Chimeric  protein 

Fusion  is  recurrently  present  in 
MCF7  breast  and  HCT116 

colon  cancer  cell  lines. 

cDNA,  genomic  and  published 
(9,20,28,32,53) 

MCF7 

Interchromosomal 

translocation 

t(3;  1 7)(p1 4. 1  ;q22) 

ATXN7,  RAD51C 

Chimeric  protein 

Decreased  expression  of 

RAD51C  found  in  the  majority 
of  breast  cancer  cell  lines. 

FISH,  cDNA,  genomic  and 
published  (28,32) 

MCF7 

Intrachromosomal 

inversion 

t(20;20)(q1 3. 1 3;q1 3. 1 3) 

SULF2,  ARFGEF2 

Chimeric  protein 

SULF2  is  a  known  tumor 
suppressor.  SULF2  siRN A 
silencing  is  tumorigenic 
in  vivo. 

cDNA,  genomic  and  published 
(9,28,32) 

MCF7 

Interchromosomal 

translocation 

t(3;20)(p1 4. 1  ;q1 3. 1 3) 

SULF2,  PRICKLE2 

Chimeric  protein 

See  above  re:  SULF2  function 
and  phenotype 

Genomic  and  published 
(9,28,32) 

MCF7 

Intrachromosomal 

indel 

t(1 9;1 9)(p1 3.1 1  ;p1 3.1 1) 

MY09B,  FCHOI 

Chimeric  protein 

MY09B  mutations  associate 
with  different  inflammatory  or 
autoimmune  diseases. 

Genomic  and  published  (9,32) 

MCF7 

Intrachromosomal 

indel 

t(1 7;1 7)(q22;q23.1 ) 

BC017255,  TMEM49 

Chimeric  protein 

High-level  amplification  at 
TMEM49  induces  high 
expression  of  miR-21 ,  which 
targets  PTEN  and  results  in 
an  aggressive  breast  cancer 
phenotype  (54). 

Genomic  and  published  (9) 

MCF7 

Intrachromosomal 

inversion 

t(17;17)(q23.1;q23.1) 

RPS6KB1,  TMEM49 

Chimeric  protein 

See  above  TMEM49  function 
and  phenotype.  RPS6KB1  is 
amplified  and  overexpressed 
in  10-30%  of  primary  breast 
cancers  and  cell  lines. 

RPS6KB1  is  regulated  by  the 
MTOR  pathway,  which 
regulates  cell  cycle,  growth, 
and  survival  (55). 

Genomic  and  published  (9,32) 

MCF7 

Intrachromosomal 

inversion 

t(5;5)(q1 2.1  ;q1 2.1 ) 

DEPDC1B,  ELOVL7 

Chimeric  protein 

ELOVL7  is  over  expressed  in 
bladder,  breast,  colorectal, 
esophageal,  gastric,  and 
prostate  cancers.  High-fat  diet 
promotes  growth  of  in  vivo 
tumors  of  ELOVL7-ex pressed 
prostate  cancer  (56). 

cDNA,  genomic  and  published 
(9,28,32) 

( continued  on  next  page) 
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Table  1  ( Continued ) 


Cell  line 

Genomic  changes 

Chromosome  locations 

Genes  affected 

Effect  on  coding 

Somatic  changes  reported  in 
cancers 

Validation  method 

MCF7 

Interchromosomal 

translocation 

t(3;17)(p14.2;q22) 

TEX14,  PTPRG 

Chimeric  protein 

PTPRG  is  a  known  tumor 
suppressor  in  kidney,  lung 
and  breast  cancers.  PTPRG 
has  been  shown  to  inhibit 

MCF7  anchorage- 
independent  growth  and 
reduce  estrogenic  response 
cell  proliferation  (57). 

Genomic  and  published  (9,32) 

MCF7 

Interchromosomal 

translocation 

t(1 7;20)(q24.3;q1 3.32) 

ABCA5,  PPP4R1L 

Chimeric  protein 

Induction  of  ABCA5  correlates 
with  differentiation  state  of 
human  colon  tumor. 

Genomic  and  published  (9) 

MCF7 

Intrachromosomal 

inversion 

t(X;X)(p22.2;p22.2) 

CXorf15,  SYAP1 

Chimeric  protein 

No  known  cancer  phenotype. 

Genomic  and  published  (9,32) 

MCF7 

Interchromosomal 

translocation 

t(3;  1 7)(p1 4. 1  ;q23.2) 

BRIP1 

Truncation 

BRIP1  truncations  confer 

a  twofold  increased  risk  of 
developing  breast  cancer. 
Truncation  mutants  block 
double  stranded  break  repair. 

Genomic  and  published  (9) 

HCC1954 

Interchromosomal 

translocation 

t(5;8)(q23.1  ;q1 3. 1 3) 

EIF3E 

Truncation 

Truncation  is  tumorigenic 
in  vivo.  Decreased  expression 
found  in  one-third  of  all 

human  breast  carcinomas. 

cDNA,  genomic  and  published 
(12,33) 

HCC1954 

Interchromosomal 

translocation 

t(5;8)(q35.3;q24.21) 

NSD1 

Truncation 

Fusion  protein  in  acute  myeloid 
leukemia. 

FISH,  cDNA,  genomic  and 
published  (33) 

HCC1954 

Interchromosomal 

translocation 

t(5;8)(p1 5.33;q24.21 ) 

CLPTM1L ,  PVT1 

Truncation 

Amplification  of  PVT1  linked  to 
pathophysiology  of  ovarian 
and  breast  cancers. 

Genomic  and  published  (33) 

HCC1954 

Interchromosomal 

translocation 

t(5;8)(p1 5.35.2;q24.21 ) 

UIMC1  or  RAP80 

Truncation 

Recurrent  RAP80  missense 
mutations  identified  in  breast 
cancer  patients  (45,46). 

Genomic 
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Figure  2  (A)  Arc  visualizations  of  the  largest  MCF7  and  HCC1954  breakpoint  cliques  and  their  association  with  copy  number 

amplification.  Chromosome  cytobands  are  shaded  and  labeled.  The  colored  rearrangements  depict  breast  tumor  intrachromosomal 
(green)  and  interchromosomal  (purple)  mutations.  Copy  number  counts  are  plotted  with  gains  in  blue,  losses  in  red,  and  normal  diploid 
in  grey  (count  scales  are  from  zero  to  MCF7:  max  =  60;  HCC1954:  max  =  15).  (B)  MCF7  and  HCC1954  log2  copy  number  plots  of 
Affymetrix  100K  SNP  chip  arrays  (34)  (top)  and  lllumina  mate  pair  mapped  sequence  counts  (bottom);  gains  are  plotted  in  blue,  losses 
in  red,  and  normal  diploid  in  grey.  Highlighted  regions  correspond  to  the  largest  breakpoint  cliques  from  A. 


DSBR  genes  may  have  contributed  to  different  patterns  of 
genomic  instability.  For  example,  in  MCF7  we  identified  the 
gene— gene  fusion  of  RAD51C  exons  1—7  to  the  neuronal- 
specific  gene  ATXN7  exons  6-13  resulting  in  an 
expressed  chimeric  transcript.  RAD51C  is  a  paralog  of 
RAD51  a  gene  central  to  DNA  DSBR.  RAD51C  is  an 
essential  component  of  a  complex  reported  to  be  involved  in 
resolving  HJs  (29)  formed  during  DSBR  (38)  and,  as  such,  is 
integral  to  the  maintenance  of  genomic  stability.  The  trans¬ 
location  we  have  identified  eliminates  the  domain  of  RAD51C 
that  binds  other  family  members  such  as  RAD51D  and 
XRCC3  (39),  possibly  disrupting  formation  of  the  complex 
responsible  for  resolving  HJs. 

Also  in  MCF7,  we  identified  a  truncation  of  the  BRIP1 
gene,  BRCA  7-interacting  protein-1.  BRIP1  was  originally 
identified  as  a  helicase-like  protein  that  interacts  directly  with 
BRCA1  and  contributes  to  its  DNA  repair  function.  BRIP1 
binds  to  the  BRCT  repeat  in  BRCA1.  The  C-terminus  of 
BRIP1  is  critical  for  its  interaction  with  BRCA1,  and  a  trun¬ 
cation  mutant  has  been  shown  to  block  DSBR  (40—42). 
Clinically,  germline  truncation  mutations  of  BRIP1  have  been 
identified  in  familial  breast  cancer  without  mutations  of 
BRCA1/2,  and  BRIP1  truncations  confer  a  twofold  increased 
risk  of  developing  breast  cancer.  We  identified 


a  translocation  that  results  in  the  loss  of  the  last  three  exons 
(exons  18—20);  however,  the  fused  DNA  (3p14)  downstream 
of  BRIP1  does  not  contain  any  exons  or  introns.  The  trun¬ 
cation  at  exon  17  of  BRIP1  would  eliminate  the  C-terminal 
third  of  BRIP1  and  eliminate  binding  to  BRCA1.  However,  it 
is  unclear  at  present  whether  the  truncated  mRNA  would  be 
stable  because  there  is  no  transcription  stop  site  or  polyA  tail. 

In  HCC1954  we  discovered  a  novel  gene  truncation  of 
UIMC1  (also  referred  to  as  the  BRCA1-A  complex  subunit 
RAP80).  RAP80  has  been  extensively  studied  because  of  its 
roles  in  localizing  BRCA1  to  DNA  double  strand  break  sites, 
regulating  BRCA  7-dependent  DNA  damage  checkpoint 
function,  and  as  a  potential  tumor  suppressor  (43,44). 
Whereas  many  RAP80  missense  SNP  mutations  have  been 
identified  in  non-BRCA1/2  multiethnic  breast  cancer  cases 
(45,46),  no  truncating  mutation  of  the  RAP80  gene  in  breast 
cancer  has  been  previously  published.  Interestingly,  Dr. 
Xiaochun  Yu  has  identified  a  truncating  SNP  mutation  on 
RAP80  cDNA  (G1107A)  in  the  ovarian  adenocarcinoma  cell 
line  TOV21G  that  results  in  a  premature  stop  codon  at 
Trp369.  This  truncation  product  disrupts  the  RAP80  inter¬ 
action  with  BRCA1  and  fails  to  localize  to  nuclear  foci  after 
DNA  damage  (47).  The  UIMC1  truncation  we  identified 
cleaves  the  native  transcript  after  exon  10  and  results  in  loss 
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of  the  C-terminus  exons  11-15,  similarly  eliminating  DNA 
recognition  and  binding  capability. 

Although  our  fosmid  diTags  and  lllumina  mate  pairs  do 
not  detect  the  previously  published  HCC1954  gene  trunca¬ 
tion  of  MRE11A  (33),  we  did  confirm  the  existence  of  the 
t(4;1 1)(q32;q21)  genomic  lesion  involving  MRE11A  in 
another  study  by  using  2  kb  Life  Technologies  SOLiD  mate 
pairs  (unpublished  data).  MRE11A  is  involved  in  homologous 
recombination,  telomere  length  maintenance  and  DNA 
DSBR;  and  this  truncation  eliminates  its  DNA  binding  domain 
in  the  HCC1954  breast  cancer  cell  line. 

Coinciding  occurrences  of  rearrangement 
clustering  and  amplification  point  to  similar 
histories  of  genomic  instability  in  MCF7  and 
HCC1954 

As  evidenced  from  Figure  1A,  the  breakpoints  in  MCF7  and 
HCC1954  are  not  evenly  distributed  across  the  genome. 
A  number  of  clusters  of  closely  spaced  breakpoints  are 
evident.  To  formally  delineate  clustered  breakpoints  from  the 
remainder,  breakpoints  within  2  Mb  in  MCF7  and  5  Mb  in 
HCC1954  were  clustered.  In  each  cell  line,  the  cluster  con¬ 
taining  the  highest  number  of  breakpoints  was  selected  to 
seed  a  connected  graph  where  chromosome  segments  are 
nodes  and  spanning  breakpoints  are  edges.  In  MCF7,  four 
clusters  emerged  at  cytobands  1  pi  3. 1  -p21 . 1 , 3p1 4. 1  -pi  4.2, 
17q22-q24.3,  and  20q12-q13.33.  In  HCC1954,  five  clusters 
emerged  at  cytobands  5p1 5.3,  5q22.3-q23.2,  5q35.2-q35.3, 
8q22.2-q24.22,  and  1 1  ql 3.2— ql 2.3.  Moreover,  the  four 
MCF7  and  five  HCC1954  clustered  breakpoint  locations 
coincide  exactly  with  high-level  amplifications  in  their  respec¬ 
tive  genomes,  indicating  possible  positive  selection  and 
functional  significance  (Figure  2). 

The  amplification  patterns  found  in  MCF7  and  HCC1954 
are  consistent  with  the  complex  firestorm  pattern  described 
by  Hicks  et  al.  that  associate  with  breast  cancer  prognostic 
markers  and  correspond  with  poor  patient  outcomes  (48). 
Interestingly,  the  most  often  detected  recurrent  locations  of 
firestorm  amplification  identified  by  Hicks  et  al.  within  the  243 
breast  tumors  studied,  reside  on  chromosomal  arms  1 1  q  and 
17q.  These  loci  contain  the  genes  CCND1  on  1 1  q  and 
ERBB2  on  17q,  noted  previously  to  be  highly  amplified  in 
HCC1954,  and  may  drive  selection  for  these  amplifying 
mutations. 

In  both  cell  lines,  the  remaining  nonclustered  or  dispersed 
breakpoints  were  highly  associated  with  LCRs.  The 
dispersed  breakpoints  in  MCF7  show  a  9.8-fold  enrichment 
for  LCRs,  while  the  trend  is  reiterated  in  HCC1954  with  a  9.1- 
fold  enrichment  for  LCRs.  LCR  enrichment  at  dispersed 
breakpoints  is  a  characteristic  previously  described  in  MCF7 
(28),  and  is  recurrently  identified  in  HCC1954.  This  finding  is 
in  contrast  to  the  clustered  breakpoints,  which  do  not  exhibit 
enrichment  for  LCRs. 

Discussion 

It  is  known  that  chromosomal  rearrangements  are  highly 
associated  with  repetitive  sequences  in  genomic  disorders 
and  cancer.  Up  to  a  quarter  of  entries  in  the  Gross 


Rearrangement  Breakpoint  Database  (http://www.uwcm.ac. 
uk/uwcm/mg/grabd)  show  presence  of  repetitive  elements 
(49).  The  repetitive  elements  range  in  size  and  may  be  as 
large  as  6  kb  in  the  case  of  long  interspersed  nuclear 
elements  and  may  cluster,  creating  long  stretches  of 
nonunique  sequence.  Breakpoints  that  overlap  repetitive 
sequence  elements  may  not  be  detected  by  5  kb  (or  shorter- 
range)  mate  pair  libraries.  Even  if  the  breakpoint  is  detected, 
the  nonunique  sequence  surrounding  the  rearrangement 
may  make  validation  by  PCR  challenging.  Having  large 
clonal-sized  inserts,  such  as  fosmid  diTags  overcome  this 
problem  by  spanning  repetitive  sequences  and  correctly 
identifying  aberrant  fusions.  For  example,  in  our  previous 
study  of  MCF7  cells,  we  identified  the  expressed 
DEPDC1B-ELOVL2  chimeric  mRNA  transcript,  which  is 
formed  by  a  5q1 2.1  intrachromosomal  inversion  (28).  This 
breakpoint  was  detected  by  using  fosmid  diTags,  but  not  5 
kb  sized  mate  pairs  as  a  result  of  presence  of  long  or  short 
interspersed  nuclear  elements  and  microsatellites 
surrounding  the  site  of  rearrangement. 

In  many  cases,  optimal  PCR  primer  design  is  hindered  by 
the  presence  of  repetitive  sequence  surrounding  the  join.  This 
is  common  when  rearrangements  are  facilitated  by  homolo¬ 
gous  recombination  (50,51).  Short  repetitive  elements  or 
longer  segmental  duplications  (also  referred  to  as  LCRs)  at 
sites  of  rearrangement  severely  limit  the  number  of  unique 
priming  positions.  Fosmid-sized  inserts  are  able  to  span  such 
repetitive  regions,  thus  providing  a  means  of  validating 
breakpoints  even  in  cases  of  PCR  assay  failure.  For  example, 
there  are  two  previously  published  gene  truncations  identified 
by  our  fosmid  diTag  and  5  kb  mate  pair  libraries  that  failed 
breakpoint  spanning  PCR  assay  confirmation.  First  is  the 
t(5;8)(q35.3;q24.21)  translocation  in  HCC1954  involving  the 
truncation  of  NSD1,  a  fusion  protein  also  found  in  myeloid 
leukemia  (33).  Second  is  the  t(3;15)(p14.1  ;q23.2)  trans¬ 
location  in  MCF7  involving  the  truncation  of  BRIP1,  a  BRCA1- 
interacting  protein  that  contributes  to  DNA  repair  (28). 
Although  these  two  gene  truncations  were  cross-validated  by 
fosmid  and  5  kb  sized  inserts,  PCR  assay  across  the  break¬ 
point  resulted  in  amplification  failure.  In  these  cases,  break¬ 
point  spanning  primer  design  was  hindered  as  a  result  of  the 
presence  of  interspersed  nuclear  elements  and  long  terminal 
repeats  across  the  aberrant  joins. 

We  showed  that  fosmid-sized  inserts  are  adept  at  span¬ 
ning  repetitive  sequences  known  to  exist  at  sites  of  gross 
rearrangement  and  LCRs  associated  with  homologous 
recombination.  Combining  fosmid  diTag  and  5  kb  lllumina 
mate  pair  libraries  we  were  able  to  detect  and  validate 
aberrant  fusions  involving  repetitive  genomic  sequence 
where  detection  by  shorter  end  sequence  profiles  alone  or 
validation  by  breakpoint  spanning  PCR  assays  failed.  In 
addition,  we  observed  that  those  rearrangements  detected 
by  both  insert  size  ranges  exhibit  threefold  enrichment  for 
cancer-specific  somatic  mutation  and  twofold  reduction  in 
false-positive  detection  when  compared  to  the  5  kb  mate 
pairs  alone. 

For  those  breast  cancer-specific  somatic  mutations 
involving  genes,  we  queried  transcriptome  fusion  and  trun¬ 
cation  literature  to  corroborate  our  finding  and  assess  the 
extent  to  which  our  combined  fosmid  diTag  and  5  kb  mate 
pair  libraries  rediscovered  known  chimeric  transcripts 
reported  in  MCF7  and  HCC1954.  We  identified  genomic 
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alterations  corresponding  to  upward  of  approximately  half  of 
the  published  MCF7  and  HCC1954  chimeric  mRNA  tran¬ 
scripts,  but  it  is  difficult  to  assess  the  lower  bound  of  our 
sensitivity  because  it  is  unclear  whether  the  undetected 
transcript  mutations  are  due  to  transsplicing  or  similar  post- 
transcriptional  modifications. 

We  integrated  read  density  and  breakpoint  information 
from  mapped  fosmid  diTags  and  5  kb  mate  pairs  to  accu¬ 
rately  identify  distinct  copy  number  variation  in  MCF7  and 
HCC1954.  We  discovered  distinct  driver  oncogenes  asso¬ 
ciated  with  high  copy  number  amplifications  in  MCF7  and 
HCC1954.  The  distinct  structural  mutability  profiles  between 
MCF7  and  HCC1954  correlate  to  their  phenotypic  differ¬ 
ences.  Amplified  chromosomal  segments,  breakpoint  clus¬ 
ters,  and  affected  genes  are  located  at  different  positions 
across  the  MCF7  and  FICC1954  genomes;  and  correspond 
to  overexpression  of  different  oncogenes,  silencing  of 
diverse  tumor  suppressors,  and  distinct  defects  in  DNA 
repair  machinery  responsible  for  homology-driven  repair 
of  double-stranded  DNA  breaks.  It  is  intriguing  that  in 
conjunction  with  mutations  in  the  same  DNA  repair  pathway 
we  also  find  similar  patterns  of  structural  mutability  in  the  two 
cell  lines.  Both  have  clustered  and  dispersed  breakpoints; 
both  exhibit  clustered  breakpoints  in  regions  of  high  copy 
number  amplification  and  dispersed  breakpoints  that  are 
enriched  for  the  presence  of  LCRs. 
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